版本1.0.4
Version 1.0.4
2021 年 2 月
February 2021
了解分布式系统作者:Roberto Vitillo
Understanding Distributed Systems by Roberto Vitillo
版权所有 © 罗伯托·维蒂略。版权所有。
Copyright © Roberto Vitillo. All rights reserved.
本书的图表是使用 Excalidraw 创建的。
The book’s diagrams have been created with Excalidraw.
尽管作者已尽力确保本作品中的信息和说明准确无误,但作者不承担任何错误或遗漏的责任,包括但不限于因使用或依赖本作品而造成的损害的责任。使用本作品中包含的信息和说明的风险由您自行承担。如果本作品包含或描述的任何代码示例或其他技术受开源许可证或他人知识产权的约束,则您有责任确保您对其的使用符合此类许可证和/或权利。
While the author has used good faith efforts to ensure that the information and instructions in this work are accurate, the author disclaims all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. The use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
写书是一种极具挑战性但又有益的经历。我想分享我长期以来所学到的关于分布式系统的知识。
Writing a book is an incredibly challenging but rewarding experience. I wanted to share what I have learned about distributed systems for a very long time.
我感谢那些启发和信任我的同事。感谢 Chiara Roda、Andrea Dotti、Paolo Calafiura、Vladan Djeric、Mark Reid、Pawel Chodarcewicz 和 Nuno Cerqueira。
I appreciate the colleagues who inspired and believed in me. Thanks to Chiara Roda, Andrea Dotti, Paolo Calafiura, Vladan Djeric, Mark Reid, Pawel Chodarcewicz, and Nuno Cerqueira.
Doug Warren、Vamis Xhagjika、Gaurav Narula、Alessio Placitelli、Kofi Sarfo、Stefania Vitillo 和 Alberto Sottile 都非常友善地提供了宝贵的反馈。没有他们,这本书就不会是今天的样子。
Doug Warren, Vamis Xhagjika, Gaurav Narula, Alessio Placitelli, Kofi Sarfo, Stefania Vitillo and Alberto Sottile were all kind enough to provide invaluable feedback. Without them, the book wouldn’t be what it is today.
最后,最重要的是,感谢我的家人:雷切尔和莱昂纳多。你一直相信我。这让一切变得不同。
Finally, and above all, thanks to my family: Rachell and Leonardo. You always believed in me. That made all the difference.
根据Stack Overflow 的 2020 年开发者调查,薪酬最高的工程职位需要分布式系统专业知识。这并不奇怪,因为现代应用程序是分布式系统。
According to Stack Overflow’s 2020 developer survey, the best-paid engineering roles require distributed systems expertise. That comes as no surprise as modern applications are distributed systems.
学习构建分布式系统很困难,特别是当它们规模很大时。这并不是说缺乏信息。您可以找到有关该主题的学术论文、工程博客,甚至书籍。问题在于,可用的信息分散在各处,如果你将其放在从理论到实践的范围内,你会发现两端有很多材料,但中间却没有太多材料。
Learning to build distributed systems is hard, especially if they are large scale. It’s not that there is a lack of information out there. You can find academic papers, engineering blogs, and even books on the subject. The problem is that the available information is spread out all over the place, and if you were to put it on a spectrum from theory to practice, you would find a lot of material at the two ends, but not much in the middle.
当我第一次开始学习分布式系统时,我花了几个小时来连接理论和实践之间的缺失点。我一直在寻找一种易于理解且实用的介绍来引导我穿过信息迷宫,并让我走上成为一名从业者的道路。但没有类似的东西可用。
When I first started learning about distributed systems, I spent hours connecting the missing dots between theory and practice. I was looking for an accessible and pragmatic introduction to guide me through the maze of information and setting me on the path to becoming a practitioner. But there was nothing like that available.
这就是为什么我决定写一本书来教授分布式系统的基础知识,这样你就不必花费无数的时间绞尽脑汁去理解一切是如何组合在一起的。这是我刚开始时希望存在的指南,它基于我构建大型分布式系统的经验,该系统可扩展到每秒数百万个请求和数十亿个设备。
That is why I decided to write a book to teach the fundamentals of distributed systems so that you don’t have to spend countless hours scratching your head to understand how everything fits together. This is the guide I wished existed when I first started out, and it’s based on my experience building large distributed systems that scale to millions of requests per second and billions of devices.
我计划定期更新这本书,这就是它有版本号的原因。您可以订阅该书的登陆页面以接收更新。由于没有一本书是完美的,我总是很高兴收到反馈。因此,如果您发现错误、有改进的想法,或者只是想对某些内容发表评论,请随时给我写信1。
I plan to update the book regularly, which is why it has a version number. You can subscribe to receive updates from the book’s landing page. As no book is ever perfect, I’m always happy to receive feedback. So if you find an error, have an idea for improvement, or simply want to comment on something, always feel free to write me1.
如果您开发 Web 或移动应用程序的后端(或愿意!),那么本书适合您。构建分布式系统时,您需要熟悉网络堆栈、数据一致性模型、可扩展性和可靠性模式等等。尽管您可以在不了解任何内容的情况下构建应用程序,但您最终将花费数小时调试和重新设计其架构,并学习可以以更快、更轻松的方式获得的经验教训。即使您是一位经验丰富的工程师,本书也将帮助您填补知识空白,使您成为更好的从业者和系统架构师。
If you develop the back-end of web or mobile applications (or would like to!), this book is for you. When building distributed systems, you need to be familiar with the network stack, data consistency models, scalability and reliability patterns, and much more. Although you can build applications without knowing any of that, you will end up spending hours debugging and re-designing their architecture, learning lessons that you could have acquired in a much faster and less painful way. Even if you are an experienced engineer, this book will help you fill gaps in your knowledge that will make you a better practitioner and system architect.
如果你想在运行大规模分布式系统的公司(如亚马逊、谷歌、Facebook 或微软)找到一份工作,这本书也是系统设计面试的绝佳学习伙伴。如果您正在面试高级职位,您应该能够设计复杂的网络服务并深入研究任何垂直领域。您可以成为平衡树木的世界冠军,但如果您在设计回合中失败,您就出局了。如果你刚刚达到标准,当你的报价远低于你的预期时,即使你在其他方面都表现出色,也不要感到惊讶。
The book also makes for a great study companion for a system design interview if you want to land a job at a company that runs large-scale distributed systems, like Amazon, Google, Facebook, or Microsoft. If you are interviewing for a senior role, you are expected to be able to design complex networked services and dive deep into any vertical. You can be a world champion at balancing trees, but if you fail the design round, you are out. And if you just meet the bar, don’t be surprised when your offer is well below what you expected, even if you aced everything else.
在分布式系统中,您甚至不知道其存在的计算机出现故障可能会导致您自己的计算机无法使用。
——莱斯利·兰波特
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
– Leslie Lamport
宽松地说,分布式系统由节点组成,这些节点通过在通信链路上交换消息来协作完成某些任务。节点通常可以指物理机器(例如电话)或软件进程(例如浏览器)。
Loosely speaking, a distributed system is composed of nodes that cooperate to achieve some task by exchanging messages over communication links. A node can generically refer to a physical machine (e.g., a phone) or a software process (e.g., a browser).
我们为什么要费心去构建分布式系统呢?
Why do we bother building distributed systems in the first place?
有些应用程序本质上是分布式的。例如,Web是一个你非常熟悉的分布式系统。您可以使用在手机、平板电脑、台式机或 Xbox 上运行的浏览器访问它。它与全球其他数十亿设备一起形成了一个分布式系统。
Some applications are inherently distributed. For example, the web is a distributed system you are very familiar with. You access it with a browser, which runs on your phone, tablet, desktop, or Xbox. Together with other billions of devices worldwide, it forms a distributed system.
构建分布式系统的另一个原因是一些应用程序需要高可用性,并且需要对单节点故障具有弹性。Dropbox 会跨多个节点复制您的数据,这样单个节点的丢失不会导致您的所有数据丢失。
Another reason for building distributed systems is that some applications require high availability and need to be resilient to single-node failures. Dropbox replicates your data across multiple nodes so that the loss of a single node doesn’t cause all your data to be lost.
一些应用程序需要处理太大而无法容纳在单个节点上的工作负载,无论多么强大。例如,谷歌每秒收到来自全球各地的数十万个搜索请求。单个节点无法处理这个问题。
Some applications need to tackle workloads that are just too big to fit on a single node, no matter how powerful. For example, Google receives hundreds of thousands of search requests per second from all over the globe. There is no way a single node could handle that.
最后,某些应用程序具有物理上无法通过单个节点实现的性能要求。Netflix 可以将高分辨率电影无缝流式传输到您的电视,因为它的数据中心离您很近。
And finally, some applications have performance requirements that would be physically impossible to achieve with a single node. Netflix can seamlessly stream movies to your TV with high resolutions because it has a datacenter close to you.
本书将指导您解决设计、构建和操作分布式系统所需解决的基本挑战:通信、协调、可扩展性、弹性和操作。
This book will guide you through the fundamental challenges that need to be solved to design, build and operate distributed systems: communication, coordination, scalability, resiliency, and operations.
第一个挑战来自节点需要通过网络相互通信的事实。例如,当您的浏览器想要加载网站时,它会从 URL 解析服务器地址并向其发送 HTTP 请求。反过来,服务器将包含页面内容的响应返回给客户端。
The first challenge comes from the fact that nodes need to communicate over the network with each other. For example, when your browser wants to load a website, it resolves the server’s address from the URL and sends an HTTP request to it. In turn, the server returns a response with the content of the page to the client.
请求和响应消息如何在网络上表示?当网络暂时中断,或者某些有故障的网络交换机翻转消息中的一些位时,会发生什么情况?如何保证没有中间人能够窥探通信?
How are request and response messages represented on the wire? What happens when there is a temporary network outage, or some faulty network switch flips a few bits in the messages? How can you guarantee that no intermediary can snoop into the communication?
虽然假设某些网络库将抽象所有通信问题会很方便,但实际上它并不那么简单,因为抽象泄漏,并且您需要了解发生这种情况时堆栈如何工作。
Although it would be convenient to assume that some networking library is going to abstract all communication concerns away, in practice it’s not that simple because abstractions leak, and you need to understand how the stack works when that happens.
构建分布式系统的另一个艰巨挑战是在出现故障时将节点协调成一个连贯的整体。故障是停止工作的组件,当系统在出现一个或多个故障的情况下仍能继续运行时,系统就是容错的。“双将军”问题是一个著名的思想实验,它展示了为什么这是一个具有挑战性的问题。
Another hard challenge of building distributed systems is coordinating nodes into a single coherent whole in the presence of failures. A fault is a component that stopped working, and a system is fault-tolerant when it can continue to operate despite one or more faults. The “two generals” problem is a famous thought experiment that showcases why this is a challenging problem.
假设有两个将军(节点),每个将军都指挥着自己的军队,需要商定联合攻击城市的时间。军队之间有一定的距离,唯一的沟通方式就是发送信使(消息)。不幸的是,这些信使可能被敌人捕获(网络故障)。
Suppose there are two generals (nodes), each commanding its own army, that need to agree on a time to jointly attack a city. There is some distance between the armies, and the only way to communicate is by sending a messenger (messages). Unfortunately, these messengers can be captured by the enemy (network failure).
有没有办法让将军们商定一个时间?那么,将军 1 可以向将军 2 发送一条包含建议时间的消息,然后等待响应。如果没有回应怎么办?其中一名使者被俘虏了吗?也许有信使受伤了,而且到达目的地的时间比预计的要长?将军是否应该再派一个使者呢?
Is there a way for the generals to agree on a time? Well, general 1 could send a message with a proposed time to general 2 and wait for a response. What if no response arrives, though? Was one of the messengers captured? Perhaps a messenger was injured, and it’s taking longer than expected to arrive at the destination? Should the general send another messenger?
你可以看到这个问题比最初看起来要困难得多。事实证明,无论派出多少使者,两位将军都不能完全确定对方军队会同时攻城。虽然派遣更多的使者增加了将军的信心,但它永远不会达到绝对的确定性。
You can see that this problem is much harder than it originally appeared. As it turns out, no matter how many messengers are dispatched, neither general can be completely certain that the other army will attack the city at the same time. Although sending more messengers increases the general’s confidence, it never reaches absolute certainty.
由于协调是一个关键主题,因此本书的第二部分专门讨论用于实现协调的分布式算法。
Because coordination is such a key topic, the second part of this book is dedicated to distributed algorithms used to implement coordination.
分布式系统的性能代表了它处理负载的效率,通常用吞吐量和响应时间来衡量。吞吐量是每秒处理的操作数,响应时间是客户端请求和响应之间的总时间。
The performance of a distributed system represents how efficiently it handles load, and it’s generally measured with throughput and response time. Throughput is the number of operations processed per second, and response time is the total time between a client request and its response.
负载可以通过不同的方式来测量,因为它特定于系统的用例。例如,并发用户数、通信链路数或写入与读取的比率都是不同形式的负载。
Load can be measured in different ways since it’s specific to the system’s use cases. For example, number of concurrent users, number of communication links, or ratio of writes to reads are all different forms of load.
随着负载的增加,最终会达到系统的容量——系统能够承受的最大负载。此时,系统的性能要么趋于稳定,要么恶化,如图1.1所示。如果系统上的负载继续增长,最终将达到大多数操作失败或超时的程度。
As the load increases, it will eventually reach the system’s capacity — the maximum load the system can withstand. At that point, the system’s performance either plateaus or worsens, as shown in Figure 1.1. If the load on the system continues to grow, it will eventually hit a point where most operations fail or timeout.
图 1.1:y 轴上的系统吞吐量是可以无错误且响应时间短地处理的客户端请求(x 轴)的子集,也称为吞吐量。
Figure 1.1: The system throughput on the y axis is the subset of client requests (x axis) that can be handled without errors and with low response times, also referred to as its goodput.
分布式系统的容量取决于其架构和复杂的物理限制网络,例如节点的内存大小和时钟周期,以及网络链接的带宽和延迟。
The capacity of a distributed system depends on its architecture and an intricate web of physical limitations like the nodes’ memory size and clock cycle, and the bandwidth and latency of network links.
增加容量的一种快速而简单的方法是购买具有更好性能的更昂贵的硬件,这称为扩展。但这迟早会碰壁。当该选项不再可用时,替代方案是通过向系统添加更多计算机来进行扩展。
A quick and easy way to increase the capacity is buying more expensive hardware with better performance, which is referred to as scaling up. But that will hit a brick wall sooner or later. When that option is no longer available, the alternative is scaling out by adding more machines to the system.
在本书的第三部分中,我们将探讨可用于横向扩展应用程序的主要架构模式:功能分解、复制和分区。
In the book’s third part, we will explore the main architectural patterns that you can leverage to scale out applications: functional decomposition, duplication, and partitioning.
当分布式系统即使在发生故障时也能继续完成其工作时,它就具有弹性。从规模上看,任何可能发生的故障最终都会发生。系统的每个组件都有可能发生故障——节点可能崩溃、网络链路可能被切断等。无论这个概率有多小,组件越多,系统执行的操作越多,绝对数量就越高的失败变成。而且情况会变得更糟,因为故障通常不是独立的,一个组件的故障可能会增加另一个组件发生故障的可能性。
A distributed system is resilient when it can continue to do its job even when failures happen. And at scale, any failure that can happen will eventually occur. Every component of a system has a probability of failing — nodes can crash, network links can be severed, etc. No matter how small that probability is, the more components there are, and the more operations the system performs, the higher the absolute number of failures becomes. And it gets worse, since failures typically are not independent, the failure of a component can increase the probability that another one will fail.
不加以检查的故障可能会影响系统的可用性,系统的可用性定义为应用程序可以服务请求的时间除以所测量的时间段的持续时间。换句话说,它是系统能够为请求提供服务并完成有用工作的时间百分比。
Failures that are left unchecked can impact the system’s availability, which is defined as the amount of time the application can serve requests divided by the duration of the period measured. In other words, it’s the percentage of time the system is capable of servicing requests and doing useful work.
可用性通常用 9 来描述,这是表达可用性百分比的简写方式。三个九通常被认为是可以接受的,四个以上的任何值都被认为是高度可用的。
Availability is often described with nines, a shorthand way of expressing percentages of availability. Three nines are typically considered acceptable, and anything above four is considered to be highly available.
| 可用性 % | 每天的停机时间 |
|---|---|
| 90%(“一九”) | 2.40小时 |
| 99%(“两个九”) | 14.40 分钟 |
| 99.9%(“三个九”) | 1.44分钟 |
| 99.99%(“四个九”) | 8.64秒 |
| 99.999%(“五个九”) | 864 毫秒 |
如果系统不能适应故障(故障只会随着应用程序扩展以处理更多负载而增加),那么其可用性将不可避免地下降。因此,分布式系统需要接受故障并使用冗余和自我修复机制等技术来解决故障。
If the system isn’t resilient to failures, which only increase as the application scales out to handle more load, its availability will inevitably drop. Because of that, a distributed system needs to embrace failure and work around it using techniques such as redundancy and self-healing mechanisms.
作为一名工程师,您需要保持偏执,并通过考虑组件发生故障的可能性及其发生时产生的影响来评估组件可能发生故障的风险。如果风险很高,您将需要减轻风险。本书的第 4 部分致力于容错,并介绍了各种弹性模式,例如速率限制和断路器。
As an engineer, you need to be paranoid and assess the risk that a component can fail by considering the likelihood of it happening and its resulting impact when it does. If the risk is high, you will need to mitigate it. Part 4 of the book is dedicated to fault tolerance and it introduces various resiliency patterns, such as rate limiting and circuit breakers.
分布式系统需要测试、部署和维护。过去,一个团队开发一个应用程序,另一个团队负责运营它。微服务和 DevOps 的兴起改变了这一点。设计系统的同一团队还负责其现场运营。这是一件好事,因为要找出系统的不足之处,没有比通过随叫随到来体验它更好的方法了。
Distributed systems need to be tested, deployed, and maintained. It used to be that one team developed an application, and another was responsible for operating it. The rise of microservices and DevOps has changed that. The same team that designs a system is also responsible for its live-site operation. That’s a good thing as there is no better way to find out where a system falls short than experiencing it by being on-call for it.
新的部署需要以安全的方式持续推出,而不影响系统的可用性。系统需要可观察,以便随时轻松了解正在发生的情况。当其服务水平目标面临被破坏的风险时,需要发出警报,并且需要有人介入。本书的最后部分探讨了测试和操作分布式系统的最佳实践。
New deployments need to be rolled out continuously in a safe manner without affecting the system’s availability. The system needs to be observable so that it’s easy to understand what’s happening at any time. Alerts need to fire when its service level objectives are at risk of being breached, and a human needs to be looped in. The book’s final part explores best practices to test and operate distributed systems.
分布式系统有各种形状和大小。本书将讨论重点放在由商品机器组成的系统后端,这些机器协同工作以实现业务功能。这包括当今正在建造的大多数大型系统。
Distributed systems come in all shapes and sizes. The book anchors the discussion to the backend of systems composed of commodity machines that work in unison to implement a business feature. This comprises the majority of large scale systems being built today.
在开始解决基础知识之前,我们需要讨论将分布式系统分解为部件和关系(或者换句话说,其架构)的不同方式。根据您观察的角度不同,架构也会有所不同。
Before we can start tackling the fundamentals, we need to discuss the different ways a distributed system can be decomposed into parts and relationships, or in other words, its architecture. The architecture differs depending on the angle you look at it.
从物理上来说,分布式系统是通过网络链路进行通信的物理机器的集合。
Physically, a distributed system is an ensemble of physical machines that communicate over network links.
在运行时,分布式系统由软件进程组成,这些软件进程通过 HTTP 等进程间通信( IPC) 机制进行通信,并托管在计算机上。
At run-time, a distributed system is composed of software processes that communicate via inter-process communication (IPC) mechanisms like HTTP, and are hosted on machines.
从实现的角度来看,分布式系统是一组松散耦合的组件,可以独立部署和扩展,称为服务。
From an implementation perspective, a distributed system is a set of loosely-coupled components that can be deployed and scaled independently called services.
服务实现整个系统功能的一个特定部分。其实现的核心是业务逻辑,它公开用于与外界通信的接口。我所说的接口是指您选择的语言提供的接口,例如 Java 或 C#。“入站”接口定义服务向其客户端提供的操作。相反,“出站”接口定义服务用于与外部服务(例如数据存储、消息传递服务等)通信的操作。
A service implements one specific part of the overall system’s capabilities. At the core of its implementation is the business logic, which exposes interfaces used to communicate with the outside world. By interface, I mean the kind offered by your language of choice, like Java or C#. An “inbound” interface defines the operations that a service offers to its clients. In contrast, an “outbound” interface defines operations that the service uses to communicate with external services, like data stores, messaging services, and so on.
远程客户端不能只调用接口,这就是为什么需要适配器将 IPC 机制与服务的接口连接起来。入站适配器是服务的应用程序编程接口(API) 的一部分;它通过调用入站接口中定义的操作来处理从 IPC 机制(如 HTTP)收到的请求。相反,出站适配器实现服务的出站接口,授予业务逻辑对外部服务(例如数据存储)的访问权限。如图1.2所示。
Remote clients can’t just invoke an interface, which is why adapters are required to hook up IPC mechanisms with the service’s interfaces. An inbound adapter is part of the service’s Application Programming Interface (API); it handles the requests received from an IPC mechanism, like HTTP, by invoking operations defined in the inbound interfaces. In contrast, outbound adapters implement the service’s outbound interfaces, granting the business logic access to external services, like data stores. This is illustrated in Figure 1.2.
图 1.2:业务逻辑使用 Kafka 生产者实现的消息传递接口来发送消息,并使用存储库接口来访问 SQL 存储。相反,HTTP 控制器使用服务接口处理传入请求。
Figure 1.2: The business logic uses the messaging interface implemented by the Kafka producer to send messages and the repository interface to access the SQL store. In contrast, the HTTP controller handles incoming requests using the service interface.
运行服务的进程称为服务器,而向服务器发送请求的进程称为客户端。有时,进程既是客户端又是服务器,因为两者并不相互排斥。
A process running a service is referred to as a server, while a process that sends requests to a server is referred to as a client. Sometimes, a process is both a client and a server, since the two aren’t mutually exclusive.
为简单起见,我将假设服务的单个实例完全在单个服务器进程的边界内运行。同样,我假设一个进程有一个线程。这让我可以忽略一些实现细节,这些细节只会使我们的讨论变得复杂,而不会增加太多价值。
For simplicity, I will assume that an individual instance of a service runs entirely within the boundaries of a single server process. Similarly, I assume that a process has a single thread. This allows me to neglect some implementation details that only complicate our discussion without adding much value.
在本书的其余部分中,我将在不同的架构观点之间切换(参见图1.3),具体取决于哪种观点更适合讨论特定主题。请记住,它们只是看待同一系统的不同方式。
In the rest of the book, I will switch between the different architectural points of view (see Figure 1.3), depending on which one is more appropriate to discuss a particular topic. Remember that they are just different ways to look at the same system.
图 1.3:本书中使用的不同架构观点。
Figure 1.3: The different architectural points of view used in this book.
通过网络的进程之间的通信,或进程间通信(IPC),是分布式系统的核心。网络协议以堆栈的形式排列,其中每一层都建立在下一层提供的抽象之上,较低的层更接近硬件。当一个进程通过网络向另一个进程发送数据时,它会在堆栈中从顶层移动到底层,在另一端反之亦然,如图 1.4所示。
Communication between processes over the network, or inter-process communication (IPC), is at the heart of distributed systems. Network protocols are arranged in a stack, where each layer builds on the abstraction provided by the layer below, and lower layers are closer to the hardware. When a process sends data to another through the network, it moves through the stack from the top layer to the bottom one and vice-versa on the other end, as shown in Figure 1.4.
图 1.4:互联网协议套件
Figure 1.4: Internet protocol suite
链路层由在本地网络链路(如以太网或 Wi-Fi)上运行的网络协议组成,并提供到底层网络硬件的接口。交换机在这一层运行,并根据以太网数据包的目标 MAC 地址转发以太网数据包。
The link layer consists of network protocols that operate on local network links, like Ethernet or Wi-Fi, and provides an interface to the underlying network hardware. Switches operate at this layer and forward Ethernet packets based on their destination MAC address.
互联网层使用地址将数据包通过网络从一台机器路由到另一台机器。互联网协议 (IP) 是该层的核心协议,它尽力传送数据包。路由器在这一层运行,并根据 IP 数据包的目标 IP 地址转发 IP 数据包。
The internet layer uses addresses to route packets from one machine to another across the network. The Internet Protocol (IP) is the core protocol of this layer, which delivers packets on a best-effort basis. Routers operate at this layer and forward IP packets based on their destination IP address.
传输层使用端口号在两个进程之间传输数据,以寻址任一端的进程。这一层最重要的协议是传输控制协议(TCP)。
The transport layer transmits data between two processes using port numbers to address the processes on either end. The most important protocol in this layer is the Transmission Control Protocol (TCP).
应用层定义高级通信协议,例如 HTTP 或 DNS。通常,您的代码将针对此抽象级别。
The application layer defines high-level communication protocols, like HTTP or DNS. Typically your code will target this level of abstraction.
尽管每个协议都建立在另一个协议之上,但有时抽象会泄漏。如果您不知道底层是如何工作的,您将很难解决不可避免出现的网络问题。
Even though each protocol builds up on top of the other, sometimes the abstractions leak. If you don’t know how the bottom layers work, you will have a hard time troubleshooting networking issues that will inevitably arise.
第2章介绍如何在不可靠的通信通道 (IP) 之上构建可靠的通信通道 (TCP),该通道可能会丢弃、复制和乱序传送数据。在不可靠的抽象之上构建可靠的抽象是一种常见模式,当我们进一步探索分布式系统如何工作时,我们会多次遇到这种模式。
Chapter 2 describes how to build a reliable communication channel (TCP) on top of an unreliable one (IP), which can drop, duplicate and deliver data out of order. Building reliable abstractions on top of unreliable ones is a common pattern that we will encounter many times as we explore further how distributed systems work.
第3章介绍如何在可靠通道 (TCP) 之上构建安全通道 (TLS),该通道提供加密、身份验证和完整性。
Chapter 3 describes how to build a secure channel (TLS) on top of a reliable one (TCP), which provides encryption, authentication, and integrity.
第4章深入探讨互联网电话簿 (DNS) 的工作原理,它允许节点使用名称发现其他节点。从本质上讲,DNS 是一个分布式、分层且最终一致的键值存储。通过研究它,我们将初步体验到最终的一致性。
Chapter 4 dives into how the phone book of the Internet (DNS) works, which allows nodes to discover others using names. At its heart, DNS is a distributed, hierarchical, and eventually consistent key-value store. By studying it, we will get a first taste of eventually consistency.
第5章通过讨论服务如何公开其他节点可用于向其发送命令或通知的 API 来结束本部分。具体来说,我们将深入研究 RESTful HTTP API 的实现。
Chapter 5 concludes this part by discussing how services can expose APIs that other nodes can use to send commands or notifications to. Specifically, we will dive into the implementation of a RESTful HTTP API.
TCP是一种传输层协议,它在 IP 之上的两个进程之间提供可靠的通信通道。TCP 保证字节流按顺序到达,没有任何间隙、重复或损坏。TCP 还实现了一组稳定性模式,以避免网络或接收器不堪重负。
TCP is a transport-layer protocol that exposes a reliable communication channel between two processes on top of IP. TCP guarantees that a stream of bytes arrives in order, without any gaps, duplication or corruption. TCP also implements a set of stability patterns to avoid overwhelming the network or the receiver.
为了创建可靠通道的假象,TCP 将字节流划分为称为段的离散数据包。这些片段按顺序编号,这使得接收器能够检测到漏洞和重复项。发送的每个数据段都需要由接收方确认。当这种情况没有发生时,发送端会触发计时器,并重新传输该数据段。为了确保数据在传输过程中没有被损坏,接收方使用校验和来验证所传送段的完整性。
To create the illusion of a reliable channel, TCP partitions a byte stream into discrete packets called segments. The segments are sequentially numbered, which allows the receiver to detect holes and duplicates. Every segment sent needs to be acknowledged by the receiver. When that doesn’t happen, a timer fires on the sending side, and the segment is retransmitted. To ensure that the data hasn’t been corrupted in transit, the receiver uses a checksum to verify the integrity of a delivered segment.
在 TCP 通道上传输任何数据之前,需要打开连接。连接的状态由两端的操作系统通过套接字进行管理。套接字在其生命周期内跟踪连接的状态变化。概括地说,连接可以处于三种状态:
A connection needs to be opened before any data can be transmitted on a TCP channel. The state of the connection is managed by the operating system on both ends through a socket. The socket keeps track of the state changes of the connection during its lifetime. At a high level, there are three states the connection can be in:
不过,这是一种简化,因为状态比上述三个更多。
This is a simplification, though, as there are more states than the three above.
在建立连接之前,服务器必须侦听来自客户端的连接请求。TCP使用三次握手来创建新连接,如图2.1所示:
A server must be listening for connection requests from clients before a connection is established. TCP uses a three-way handshake to create a new connection, as shown in Figure 2.1:
TCP 使用序列号来确保数据按顺序传送且没有漏洞。
The sequence numbers are used by TCP to ensure the data is delivered in order and without holes.
图 2.1:三向握手
Figure 2.1: Three-way handshake
握手引入了完整的往返过程,其中不发送任何应用程序数据。在连接打开之前,其带宽基本上为零。往返时间越短,建立连接的速度就越快。让服务器更靠近客户端并重用连接有助于减少这种冷启动损失。
The handshake introduces a full round-trip in which no application data is sent. Until the connection has been opened, its bandwidth is essentially zero. The lower the round trip time is, the faster the connection can be established. Putting servers closer to the clients and reusing connections helps reduce this cold-start penalty.
数据传输完成后,需要关闭连接以释放两端的所有资源。该终止阶段涉及多次往返。
After data transmission is complete, the connection needs to be closed to release all resources on both ends. This termination phase involves multiple round-trips.
流量控制是一种退避机制,旨在防止发送方压垮接收方。接收方将等待进程处理的传入 TCP 段存储到接收缓冲区中,如图2.2所示。
Flow control is a backoff mechanism implemented to prevent the sender from overwhelming the receiver. The receiver stores incoming TCP segments waiting to be processed by the process into a receive buffer, as shown in Figure 2.2.
图 2.2:接收缓冲区存储应用程序尚未处理的数据。
Figure 2.2: The receive buffer stores data that hasn’t been processed yet by the application.
每当接收方确认一个数据段时,它也会向发送方反馈缓冲区的大小,如图2.3所示。如果发送方遵守协议,则可以避免发送更多可以容纳在接收方缓冲区中的数据。
The receiver also communicates back to the sender the size of the buffer whenever it acknowledges a segment, as shown in Figure 2.3. The sender, if it’s respecting the protocol, avoids sending more data that can fit in the receiver’s buffer.
图 2.3:接收缓冲区的大小在确认段的标头中传达。
Figure 2.3: The size of the receive buffer is communicated in the headers of acknowledgments segments.
这种机制与服务级别的速率限制并没有太大的不同。但是,TCP 不是对 API 密钥或 IP 地址进行速率限制,而是对连接级别进行速率限制。
This mechanism is not too dissimilar to rate-limiting at the service level. But, rather than rate-limiting on an API key or IP address, TCP is rate-limiting on a connection level.
TCP 不仅可以防止接收方不堪重负,还可以防止底层网络泛滥。
TCP not only guards against overwhelming the receiver, but also against flooding the underlying network.
发送方通过测量凭经验估计底层网络的可用带宽。发送方维护一个所谓的拥塞窗口,它表示无需另一方确认即可发送的未完成段的总数。接收窗口的大小限制了拥塞窗口的最大大小。拥塞窗口越小,任何给定时间可以传输的字节就越少,并且使用的带宽也就越少。
The sender estimates the available bandwidth of the underlying network empirically through measurements. The sender maintains a so-called congestion window, which represents the total number of outstanding segments that can be sent without an acknowledgment from the other side. The size of the receiver window limits the maximum size of the congestion window. The smaller the congestion window is, the fewer bytes can be in-flight at any given time, and the less bandwidth is utilized.
当建立新的连接时,拥塞窗口的大小被设置为系统默认值。然后,对于每个确认的段,窗口的大小呈指数增加,直到达到上限。这意味着我们无法在建立连接后立即使用网络的全部容量。往返时间 (RTT) 越短,发送方就能越快地开始利用底层网络的带宽,如图2.4所示。
When a new connection is established, the size of the congestion window is set to a system default. Then, for every segment acknowledged, the window increases its size exponentially until reaching an upper limit. This means that we can’t use the network’s full capacity right after a connection is established. The lower the round trip time (RTT) is, the quicker the sender can start utilizing the underlying network’s bandwidth, as shown in Figure 2.4.
图 2.4:RTT 越低,发送方开始利用底层网络带宽的速度就越快。
Figure 2.4: The lower the RTT is, the quicker the sender can start utilizing the underlying network’s bandwidth.
如果某个段丢失会发生什么?当发送方通过超时检测到丢失的确认时,称为拥塞避免的机制就会启动,并且拥塞窗口大小会减小。从那时起,随着时间的推移,窗口大小会增加一定量,而超时则会减少另一个量。
What happens if a segment is lost? When the sender detects a missed acknowledgment through a timeout, a mechanism called congestion avoidance kicks in, and the congestion window size is reduced. From there onwards, the passing of time increases the window size by a certain amount, and timeouts decrease it by another.
如前所述,拥塞窗口的大小定义了在未收到确认的情况下可以发送的最大字节数。由于发送方需要等待完整的往返才能获得确认,因此我们可以通过将拥塞窗口的大小除以往返时间来得出最大理论带宽:
As mentioned earlier, the size of the congestion window defines the maximum number of bytes that can be sent without receiving an acknowledgment. Because the sender needs to wait for a full round trip to get an acknowledgment, we can derive the maximum theoretical bandwidth by dividing the size of the congestion window by the round trip time:
该方程表明带宽是延迟的函数。TCP 将非常努力地优化窗口大小,因为它对往返时间无能为力。然而,这并不总是产生最佳配置。由于拥塞控制的工作方式,往返时间越短,底层网络带宽的利用就越好。这是将服务器放置在地理位置靠近客户端的更多原因。
The equation shows that bandwidth is a function of latency. TCP will try very hard to optimize the window size since it can’t do anything about the round trip time. However, that doesn’t always yield the optimal configuration. Due to the way congestion control works, the lower the round trip time is, the better the underlying network’s bandwidth is utilized. This is more reason to put servers geographically close to the clients.
TCP 的可靠性和稳定性是以低于底层网络实际能够提供的带宽和更高的延迟为代价的。如果放弃 TCP 提供的稳定性和可靠性机制,您将得到一个名为用户数据报协议(UDP)的简单协议— 一种无连接传输层协议,可用作 TCP 的替代方案。
TCP’s reliability and stability come at the price of lower bandwidth and higher latencies than the underlying network is actually capable of delivering. If you drop the stability and reliability mechanisms that TCP provides, what you get is a simple protocol named User Datagram Protocol (UDP) — a connectionless transport layer protocol that can be used as an alternative to TCP.
与 TCP 不同,UDP 不向其客户端公开字节流的抽象。客户端只能发送大小有限的离散数据包,称为数据报。UDP 不提供任何可靠性,因为数据报没有序列号并且不被确认。UDP 也不实现流量和拥塞控制。总体而言,UDP 是一种精简且准系统的协议。它用于引导自定义协议,提供 TCP 所提供的部分(但不是全部)稳定性和可靠性保证1。
Unlike TCP, UDP does not expose the abstraction of a byte stream to its clients. Clients can only send discrete packets, called datagrams, with a limited size. UDP doesn’t offer any reliability as datagrams don’t have sequence numbers and are not acknowledged. UDP doesn’t implement flow and congestion control either. Overall, UDP is a lean and barebone protocol. It’s used to bootstrap custom protocols, which provide some, but not all, of the stability and reliability guarantees that TCP does1.
例如,在现代多人游戏中,客户端每秒多次采样游戏手柄、鼠标和键盘事件,并将它们发送到跟踪全局游戏状态的服务器。同样,服务器每秒对游戏状态进行多次采样,并将这些快照发送回客户端。如果快照在传输过程中丢失,随着游戏的实时发展,重新传输它就没有任何价值;当重新传输的快照到达目的地时,它就已经过时了。这是 UDP 发挥作用的用例,因为 TCP 会尝试重新传送丢失的数据,从而降低客户端的体验。
For example, in modern multi-player games, clients sample gamepad, mouse and keyboard events several times per second and send them to a server that keeps track of the global game state. Similarly, the server samples the game state several times per second and sends these snapshots back to the clients. If a snapshot is lost in transmission, there is no value in retransmitting it as the game evolves in real-time; by the time the retransmitted snapshot would get to the destination, it would be obsolete. This is a use case where UDP shines, as TCP would attempt to redeliver the missing data and consequently slow down the client’s experience.
我们现在知道如何通过网络可靠地将字节从一个进程发送到另一个进程。问题是这些字节是以明文形式发送的,任何中间人都可以拦截我们的通信。为了防止这种情况,我们可以使用传输层安全(TLS)协议。TLS 运行在 TCP 之上,并对通信通道进行加密,以便应用层协议(例如 HTTP)可以利用它进行安全通信。简而言之,TLS 提供加密、身份验证和完整性。
We now know how to reliably send bytes from one process to another over the network. The problem is these bytes are sent in the clear, and any middle-man can intercept our communication. To protect against that, we can use the Transport Layer Security (TLS) protocol. TLS runs on top of TCP and encrypts the communication channel so that application layer protocols, like HTTP, can leverage it to communicate securely. In a nutshell, TLS provides encryption, authentication, and integrity.
加密保证客户端和服务器之间传输的数据被混淆,并且只能由通信进程读取。
Encryption guarantees that the data transmitted between a client and a server is obfuscated and can only be read by the communicating processes.
首次打开 TLS 连接时,客户端和服务器会使用非对称加密协商共享加密密钥。双方生成一个由私有部分和公共部分组成的密钥对。然后,这些进程能够通过交换其公钥来创建共享秘密。由于密钥对的一些数学特性,这是可能的。这种方法的优点在于共享秘密永远不会通过线路进行通信。
When the TLS connection is first opened, the client and the server negotiate a shared encryption secret using asymmetric encryption. Both parties generate a key-pair consisting of a private and public part. The processes are then able to create a shared secret by exchanging their public keys. This is possible thanks to some mathematical properties of the key-pairs. The beauty of this approach is that the shared secret is never communicated over the wire.
尽管非对称加密速度慢且成本高,但它仅用于创建共享加密密钥。之后使用对称加密,速度快,成本低。共享密钥会定期重新协商,以最大限度地减少共享密钥被破坏时可解密的数据量。
Although asymmetric encryption is slow and expensive, it’s only used to create the shared encryption key. After that, symmetric encryption is used, which is fast and cheap. The shared key is periodically renegotiated to minimize the amount of data that can be deciphered if the shared key is broken.
加密运行中的数据会对 CPU 造成影响,但可以忽略不计,因为现代处理器实际上带有加密指令。除非您有充分的理由,否则您应该对所有通信使用 TLS,即使是那些不通过公共 Internet 的通信。
Encrypting in-flight data has a CPU penalty, but it’s negligible since modern processors actually come with cryptographic instructions. Unless you have a very good reason, you should use TLS for all communications, even those that are not going through the public Internet.
尽管我们有办法混淆通过网络传输的数据,但客户端仍然需要对服务器进行身份验证,以验证其身份。类似地,服务器可能想要验证客户端的身份。
Although we have a way to obfuscate data transmitted across the wire, the client still needs to authenticate the server to verify it’s who it claims to be. Similarly, the server might want to authenticate the identity of the client.
TLS 使用基于非对称加密技术的数字签名来实现身份验证。服务器生成包含私钥和公钥的密钥对,并与客户端共享其公钥。当服务器向客户端发送消息时,它用自己的私钥对其进行签名。客户端使用服务器的公钥来验证数字签名是否确实是用私钥签名的。由于密钥对的数学特性,这是可能的。
TLS implements authentication using digital signatures based on asymmetric cryptography. The server generates a key-pair with a private and a public key, and shares its public key with the client. When the server sends a message to the client, it signs it with its private key. The client uses the public key of the server to verify that the digital signature was actually signed with the private key. This is possible thanks to mathematical properties of the key-pair.
这种简单方法的问题在于,客户端不知道服务器共享的公钥是否真实,因此我们有证书来证明特定实体的公钥的所有权。证书包含有关拥有实体、到期日期、公钥以及颁发该证书的第三方实体的数字签名的信息。证书的颁发实体称为证书颁发机构(CA),也用证书来表示。这将创建一个证书链,以根 CA 颁发的证书结尾,如图3.1所示,该证书自签名。
The problem with this naive approach is that the client has no idea whether the public key shared by the server is authentic, so we have certificates to prove the ownership of a public key for a specific entity. A certificate includes information about the owning entity, expiration date, public key, and a digital signature of the third-party entity that issued the certificate. The certificate’s issuing entity is called a certificate authority (CA), which is also represented with a certificate. This creates a chain of certificates that ends with a certificate issued by a root CA, as shown in Figure 3.1, which self-signs its certificate.
为了使 TLS 证书受到设备的信任,该证书或其祖先之一必须存在于客户端的受信任存储中。受信任的根 CA(例如Let's Encrypt )通常由操作系统供应商默认包含在客户端的受信任存储中。
For a TLS certificate to be trusted by a device, the certificate, or one of its ancestors, must be present in the trusted store of the client. Trusted root CA’s, such as Let’s Encrypt, are typically included in the client’s trusted store by default by the operating system vendor.
图 3.1:证书链以根 CA 颁发的自签名证书结束。
Figure 3.1: A certificate chain ends with a self-signed certificate issued by a root CA.
打开 TLS 连接时,服务器会将完整的证书链发送到客户端,从服务器的证书开始,到根 CA 结束。客户端通过扫描证书链来验证服务器的证书,直到找到它信任的证书。然后从链中的该点开始按相反的顺序验证证书。验证会检查多项内容,例如证书的到期日期以及数字签名是否确实由颁发 CA 签署。如果验证到达路径中的最后一个证书且没有错误,则路径已验证,并且服务器已通过身份验证。
When a TLS connection is opened, the server sends the full certificate chain to the client, starting with the server’s certificate and ending with the root CA. The client verifies the server’s certificate by scanning the certificate chain until a certificate is found that it trusts. Then the certificates are verified in the reverse order from that point in the chain. The verification checks several things, like the certificate’s expiration date and whether the digital signature was actually signed by the issuing CA. If the verification reaches the last certificate in the path without errors, the path is verified, and the server is authenticated.
使用 TLS 时最常见的错误之一是让证书过期。发生这种情况时,客户端将无法验证服务器的身份,并且打开与远程进程的连接将失败。这可能会导致整个服务瘫痪,因为客户端无法再与其连接。自动化监控和自动续订即将到期的证书非常值得投资。
One of the most common mistakes when using TLS is letting a certificate expire. When that happens, the client won’t be able to verify the server’s identity, and opening a connection to the remote process will fail. This can bring an entire service down as clients are no longer able to connect with it. Automation to monitor and auto-renew certificates close to expiration is well worth the investment.
即使数据被混淆,中间人仍然可以篡改它;例如,可以交换消息中的随机位。为了防止篡改,TLS 通过计算消息摘要来验证数据的完整性。安全散列函数用于创建消息验证代码(HMAC)。当进程接收到消息时,它会重新计算消息的摘要并检查它是否与消息中包含的摘要匹配。如果不是,则消息在传输过程中已被损坏或已被篡改。在这种情况下,消息将被丢弃。
Even if the data is obfuscated, a middle man could still tamper with it; for example, random bits within the messages could be swapped. To protect against tampering, TLS verifies the integrity of the data by calculating a message digest. A secure hash function is used to create a message authentication code (HMAC). When a process receives a message, it recomputes the digest of the message and checks whether it matches the digest included in the message. If not, then the message has either been corrupted during transmission or has been tampered with. In this case, the message is dropped.
TLS HMAC 还可以防止数据损坏,而不仅仅是篡改。您可能想知道如果 TCP 应该保证数据的完整性,那么数据如何会被损坏。虽然 TCP 确实使用校验和来防止数据损坏,但它并不是100% 可靠,因为它无法检测大约 1600 万到 100 亿个数据包中的 1 个错误。对于 1KB 的数据包,每传输 16 GB 到 10 TB 就会发生这种情况。
The TLS HMAC protects against data corruption as well, not just tampering. You might be wondering how data can be corrupted if TCP is supposed to guarantee its integrity. While TCP does use a checksum to protect against data corruption, it’s not 100% reliable because it fails to detect errors for roughly 1 in 16 million to 10 billion packets. With packets of 1KB, this can happen every 16 GB to 10 TB transmitted.
当建立新的 TLS 连接时,客户端和服务器之间会发生握手,在此期间:
When a new TLS connection is established, a handshake between the client and server occurs during which:
这些操作不一定按此顺序发生,因为现代实现使用多种优化来减少往返。握手通常需要使用 TLS 1.2 进行 2 次往返,而使用 TLS 1.3 则只需进行一次。底线是创建一个新的连接是昂贵的;将服务器在地理位置上靠近客户端并尽可能重用连接的另一个原因。
These operations don’t necessarily happen in this order as modern implementations use several optimizations to reduce round trips. The handshake typically requires 2 round trips with TLS 1.2 and just one with TLS 1.3. The bottom line is creating a new connection is expensive; yet another reason to put your servers geographically closer to the clients and reuse connections when possible.
到目前为止,我们探索了如何在位于不同机器上的两个进程之间创建可靠且安全的通道。然而,要创建与远程进程的新连接,我们仍然需要发现其 IP 地址。要将主机名解析为 IP 地址,我们可以使用互联网的电话簿:域名系统(DNS) — 一个分布式、分层且最终一致的键值存储。
So far, we explored how to create a reliable and secure channel between two processes located on different machines. However, to create a new connection with a remote process, we still need to discover its IP address. To resolve hostnames into IP addresses, we can use the phone book of the Internet: the Domain Name System (DNS) — a distributed, hierarchical, and eventually consistent key-value store.
在本章中,我们将了解 DNS 解析如何在浏览器中工作,但该过程对于任何其他客户端都是相同的。当您在浏览器中输入 URL 时,第一步是解析主机名的 IP 地址,然后使用该地址打开新的 TLS 连接。
In this chapter, we will look at how DNS resolution works in a browser, but the process is the same for any other client. When you enter a URL in your browser, the first step is to resolve the hostname’s IP address, which is then used to open a new TLS connection.
具体来说,我们来看看当您在浏览器中输入www.example.com时,DNS解析是如何工作的(见图4.1)。
Concretely, let’s take a look at how the DNS resolution works when you type www.example.com in your browser (see Figure 4.1).
浏览器检查之前是否在本地缓存中解析过主机名。如果是,则返回缓存的IP地址;否则,它将请求路由到 DNS 解析器。DNS 解析器通常是由您的 Internet 服务提供商托管的 DNS 服务器。
The browser checks whether it has resolved the hostname before in its local cache. If so, it returns the cached IP address; otherwise it routes the request to a DNS resolver. The DNS resolver is typically a DNS server hosted by your Internet Service Provider.
解析器负责迭代地转换客户端的主机名。其迭代的原因很快就会变得显而易见。解析器首先检查其本地缓存中是否有缓存条目,如果找到,则将其返回给客户端。如果不是,则查询将发送到根名称服务器(根 NS)。
The resolver is responsible for iteratively translating the hostname for the client. The reason why it’s iterative will become evident in a moment. The resolver first checks its local cache for a cached entry, and if one is found, it’s returned to the client. If not, the query is sent to a root name server (root NS).
根名称服务器将传入请求的顶级域(TLD)(例如.com)映射到负责该请求的名称服务器的地址。
The root name server maps the top-level domain (TLD) of an incoming request, like .com, to the name server’s address responsible for it.
解析器配备了 TLD 地址,将解析请求发送到域的 TLD 名称服务器(在我们的示例中为.com)。
The resolver, armed with the address of the TLD, sends the resolution request to the TLD name server for the domain, in our case .com.
TLD 名称服务器将请求的域名映射到负责该请求的权威名称服务器的地址。权威名称服务器负责特定域并保存该域内将主机名映射到 IP 地址的所有记录。
The TLD name server maps the domain name of a request to the address of the authoritative name server responsible for it. An authoritative name server is responsible for a specific domain and holds all records that map the hostnames to IP addresses within that domain.
解析器最终向权威名称服务器查询www.example.com,该服务器检查其条目中的www主机名并将与其关联的 IP 地址返回给解析器。
The resolver finally queries the authoritative name server for www.example.com, which checks its entries for the www hostname and returns the IP address associated with it back to the resolver.
如果查询包含example.com的子域(例如news.example.com ),则权威名称服务器将返回负责该子域的名称服务器的地址。
If the query included a subdomain of example.com, like, e.g., news.example.com, the authoritative name server would have returned the address of the name server responsible for the subdomain.
图4.1:DNS解析流程
Figure 4.1: DNS resolution process
在最坏的情况下,解析过程涉及多次往返,但其优点在于根名称服务器的地址是解析任何主机名所需的全部。考虑到解析主机名所涉及的成本,DNS 设计者想出减少成本的方法也就不足为奇了。
The resolution process involves several round trips in the worst case, but its beauty is that the address of a root name server is all that’s needed to resolve any hostname. Given the costs involved resolving a hostname, it comes as no surprise that the designers of DNS thought of ways to reduce them.
DNS 使用 UDP 来提供 DNS 查询,因为它精简且开销较低。UDP 在当时是一个不错的选择,因为打开新连接不需要付出任何代价。也就是说,它并不安全,因为请求是通过互联网明文发送的,允许第三方窥探。因此,业界正在缓慢推动在 TLS 之上运行 DNS。
DNS uses UDP to serve DNS queries as it’s lean and has a low overhead. UDP at the time was a great choice as there is no price to be paid to open a new connection. That said, it’s not secure, as requests are sent in the clear over the Internet, allowing third parties to snoop in. Hence, the industry is pushing slowly towards running DNS on top of TLS.
如果每个请求都必须经过多次名称服务器查找,那么解析速度将会很慢。不仅如此,还要考虑名称服务器处理全局解析负载的规模要求。缓存用于加速解析过程,因为域名到 IP 地址的映射不会经常更改 — 浏览器、操作系统和 DNS 解析器都在内部使用缓存。
The resolution would be slow if every request had to go through several name server lookups. Not only that, but think of the scale requirements on the name servers to handle the global resolution load. Caching is used to speed up the resolution process, as the mapping of domain names to IP addresses doesn’t change often — the browser, operating system, and DNS resolver all use caches internally.
这些缓存如何知道记录何时过期?每个 DNS 记录都有一个生存时间(TTL),用于通知缓存该条目的有效时间。但是,无法保证客户端能够正常运行并强制执行 TTL。当您更改 DNS 条目并发现一小部分客户端在更改几天后仍在尝试连接到旧地址时,请不要感到惊讶。
How do these caches know when to expire a record? Every DNS record has a time to live (TTL) that informs the cache how long the entry is valid. But, there is no guarantee that the client plays nicely and enforces the TTL. Don’t be surprised when you change a DNS entry and find out that a small fraction of clients are still trying to connect to the old address days after the change.
设置 TTL 需要做出权衡。如果您使用较长的 TTL,许多客户端将在很长一段时间内看不到更改。但如果将其设置得太短,则会增加名称服务器上的负载和请求的平均响应时间,因为客户端必须更频繁地解析条目。
Setting a TTL requires making a tradeoff. If you use a long TTL, many clients won’t see a change for a long time. But if you set it too short, you increase the load on the name servers and the average response time of requests because the clients will have to resolve the entry more often.
如果您的名称服务器因任何原因变得不可用,则记录的 TTL 越小,受影响的客户端数量就越多。DNS 很容易成为单点故障 - 如果您的 DNS 名称服务器关闭并且客户端无法找到您的服务的 IP 地址,他们将无法连接它。这可能会导致大规模停电。
If your name server becomes unavailable for any reason, the smaller the record’s TTL is and the higher the number of clients impacted will be. DNS can easily become a single point of failure — if your DNS name server is down and the clients can’t find the IP address of your service, they won’t have a way to connect it. This can lead to massive outages.
服务通过其业务逻辑实现的一组接口向其使用者公开操作。由于远程客户端无法直接访问这些内容,适配器(构成服务的应用程序编程接口 (API))将从 IPC 机制接收到的消息转换为接口调用,如图 5.1所示。
A service exposes operations to its consumers via a set of interfaces implemented by its business logic. As remote clients can’t access these directly, adapters — which make up the service’s application programming interface (API) — translate messages received from IPC mechanisms to interface calls, as shown in Figure 5.1.
图 5.1:适配器将从 IPC 机制接收到的消息转换为接口调用。
Figure 5.1: Adapters translate messages received from IPC mechanisms to interface calls.
客户端和服务之间的通信方式可以是直接的或间接的,具体取决于客户端是直接与服务通信还是通过代理间接与其通信。直接通信需要两个进程都启动并运行才能成功进行通信。然而,有时这种保证要么不需要,要么很难实现,在这种情况下可以使用间接通信。
The communication style between a client and a service can be direct or indirect, depending on whether the client communicates directly with the service or indirectly with it through a broker. Direct communication requires that both processes are up and running for the communication to succeed. However, sometimes this guarantee is either not needed or very hard to achieve, in which case indirect communication can be used.
在本章中,我们将重点关注称为请求-响应的直接通信方式,其中客户端向服务发送请求消息,服务用响应消息回复。这类似于函数调用,但跨越进程边界并通过网络。
In this chapter, we will focus our attention on a direct communication style called request-response, in which a client sends a request message to the service, and the service replies back with a response message. This is similar to a function call, but across process boundaries and over the network.
请求和响应消息包含以与语言无关的格式序列化的数据。该格式会影响消息的序列化和反序列化速度、是否可读以及随着时间的推移演变它的难度。像JSON这样的文本格式是自描述和人类可读的,但代价是增加了冗长和解析开销。另一方面,像Protocol Buffers这样的二进制格式比文本格式更精简、性能更高,但牺牲了人类可读性。
The request and response messages contain data that is serialized in a language-agnostic format. The format impacts a message’s serialization and deserialization speed, whether it’s human-readable, and how hard it is to evolve it over time. A textual format like JSON is self-describing and human-readable, at the expense of increased verbosity and parsing overhead. On the other hand, a binary format like Protocol Buffers is leaner and more performant than a textual one at the expense of human readability.
当客户端向服务发送请求时,它可以阻塞并等待响应到达,从而使通信同步。或者,它可以要求出站适配器在收到响应时调用回调,从而使通信异步。
When a client sends a request to a service, it can block and wait for the response to arrive, making the communication synchronous. Alternatively, it can ask the outbound adapter to invoke a callback when it receives the response, making the communication asynchronous.
同步通信效率低下,因为它会阻塞可用于执行其他操作的线程。某些语言(例如 JavaScript 和 C#)可以通过 async/await 等语言原语完全隐藏回调。这些原语使得编写异步代码与编写同步代码一样简单。
Synchronous communication is inefficient, as it blocks threads that could be used to do something else. Some languages, like JavaScript and C#, can completely hide callbacks through language primitives such as async/await. These primitives make writing asynchronous code as straightforward as writing a synchronous one.
用于请求-响应交互的最常用 IPC 技术是gRPC、REST和GraphQL。通常,组织内用于服务间通信的内部 API 是使用 gRPC 等高性能 RPC 框架实现的。相比之下,公众可用的外部 API 往往基于 REST。在本章的其余部分中,我们将逐步介绍创建 RESTful HTTP API 的过程。
The most commonly used IPC technologies for request-response interactions are gRPC, REST, and GraphQL. Typically, internal APIs used for service-to-service communications within an organization are implemented with a high-performance RPC framework like gRPC. In contrast, external APIs available to the public tend to be based on REST. In the rest of the chapter, we will walk through the process of creating a RESTful HTTP API.
HTTP是一种请求-响应协议,用于在客户端和服务器之间编码和传输信息。在HTTP 事务中,客户端向服务器的 API 端点发送请求消息,服务器回复响应消息,如图5.2所示。
HTTP is a request-response protocol used to encode and transport information between a client and a server. In an HTTP transaction, the client sends a request message to the server’s API endpoint, and the server replies back with a response message, as shown in Figure 5.2.
在 HTTP 1.1 中,消息是包含起始行、一组标头和可选正文的文本数据块:
In HTTP 1.1, a message is a textual block of data that contains a start line, a set of headers, and an optional body:
图 5.2:浏览器和 Web 服务器之间的 HTTP 事务示例。
Figure 5.2: An example HTTP transaction between a browser and a web server.
HTTP 是一种无状态协议,这意味着服务器处理请求所需的一切都需要在请求本身中指定,而不需要先前请求的上下文。HTTP 使用 TCP 来保证第2章中讨论的可靠性。当它基于 TLS 时,也称为 HTTPS。不用说,默认情况下您应该使用 HTTPS。
HTTP is a stateless protocol, which means that everything needed by a server to process a request needs to be specified within the request itself, without context from previous requests. HTTP uses TCP for the reliability guarantees discussed in chapter 2. When it rides on top of TLS, it’s also referred to as HTTPS. Needless to say, you should use HTTPS by default.
HTTP 1.1 默认情况下保持与服务器的连接打开,以避免在下一个事务发生时创建新连接。不幸的是,在收到前一个请求的响应之前,无法发出新的请求;换句话说,交易必须串行化。例如,需要获取多张图像来呈现 HTML 页面的浏览器必须一次下载一张图像,这可能非常低效。
HTTP 1.1 keeps a connection to a server open by default to avoid creating a new one when the next transaction occurs. Unfortunately, a new request can’t be issued until the response of the previous one has been received; in other words, the transactions have to be serialized. For example, a browser that needs to fetch several images to render an HTML page has to download them one at the time, which can be very inefficient.
尽管 HTTP 1.1 在技术上允许某些类型的请求以管道方式传输,但由于其局限性,它从未被广泛采用。对于 HTTP 1.1,提高传出请求吞吐量的典型方法是创建多个连接。尽管它是有代价的,因为连接会消耗内存和套接字等资源。
Although HTTP 1.1 technically allows some type of requests to be pipelined, it has never been widely adopted due to its limitations. With HTTP 1.1, the typical way to improve the throughput of outgoing requests is by creating multiple connections. Although it comes with a price because connections consume resources like memory and sockets.
HTTP 2 的设计初衷是为了解决 HTTP 1.1 的主要限制。它使用二进制协议而不是文本协议,这允许 HTTP 2 在同一连接上复用多个并发请求-响应事务。2020 年初,互联网上访问量最大的网站中约有一半正在使用新的 HTTP 2 标准。HTTP 3是 HTTP 标准的最新版本,在我撰写本文时,它正在慢慢推广到浏览器 — 它基于 UDP,并实现了自己的传输协议来解决 TCP 的一些缺点。
HTTP 2 was designed from the ground up to address the main limitations of HTTP 1.1. It uses a binary protocol rather than a textual one, which allows HTTP 2 to multiplex multiple concurrent request-response transactions on the same connection. In early 2020 about half of the most-visited websites on the Internet were using the new HTTP 2 standard. HTTP 3 is the latest iteration of the HTTP standard, which is slowly being rolled out to browsers as I write this — it’s based on UDP and implements its own transport protocol to address some of TCP’s shortcomings.
鉴于 HTTP 2 和 HTTP 3 都还没有普及,您仍然需要熟悉 HTTP 1.1,这是本书未来使用的标准,因为它的纯文本格式更容易描述。
Given that neither HTTP 2 nor HTTP 3 are ubiquitous yet, you still need to be familiar with HTTP 1.1, which is the standard the book uses going forward as its plain text format is easier to depict.
假设我们负责实现一项服务来管理电子商务应用程序的产品目录。该服务必须允许用户浏览目录并允许管理员创建、更新或删除产品。听起来很简单;服务的接口可以这样定义:
Suppose we are responsible for implementing a service to manage the product catalog of an e-commerce application. The service must allow users to browse the catalog and admins to create, update, or delete products. Sounds simple enough; the interface of the service could be defined like this:
interface CatalogService
{
List<Product> GetProducts(...);
Product GetProduct(...);
void AddProduct(...);
void DeleteProduct(...);
void UpdateProduct(...)
}interface CatalogService
{
List<Product> GetProducts(...);
Product GetProduct(...);
void AddProduct(...);
void DeleteProduct(...);
void UpdateProduct(...)
}外部客户端无法直接调用接口方法,这就是 HTTP 适配器的用武之地。它通过调用服务接口中定义的方法来处理 HTTP 请求,并将其返回值转换为 HTTP 响应。但要执行此映射,我们首先需要了解如何使用 HTTP 对 API 进行建模。
External clients can’t invoke interface methods directly, which is where the HTTP adapter comes in. It handles an HTTP request by invoking the methods defined in the service interface and converts their return values into HTTP responses. But to perform this mapping, we first need to understand how to model the API with HTTP in the first place.
HTTP 服务器托管资源。资源是信息的抽象,例如文档、图像或其他资源的集合。它由 URL 标识,该 URL 描述了服务器上资源的位置。
An HTTP server hosts resources. A resource is an abstraction of information, like a document, an image, or a collection of other resources. It’s identified by a URL, which describes the location of the resource on the server.
在我们的目录服务中,产品集合是一种资源,可以通过类似https://www.example.com/products?sort=price的 URL 进行访问,其中:
In our catalog service, the collection of products is a type of resource, which could be accessed with a URL like https://www.example.com/products?sort=price, where:
不带查询字符串的 URL 也称为 API 的/products端点。
The URL without the query string is also referred to as the API’s /products endpoint.
HTTP 为我们设计 API 提供了很大的灵活性。没有什么可以禁止我们创建看起来像远程过程的资源名称,例如/getProducts,它期望在请求正文中指定附加参数,而不是在查询字符串中。但如果我们这样做,我们将无法再通过 URL 缓存产品列表。这就是 REST 的用武之地——它是一组用于设计优雅且可扩展的 HTTP API 的约定和约束。在本章的其余部分中,我们将在有意义的地方使用 REST 原则。
HTTP gives us a lot of flexibility on how to design our API. Nothing forbids us from creating a resource name that looks like a remote procedure, like /getProducts, which expects the additional parameters to be specified in the request’s body, rather than in the query string. But if we were to do this, we could no longer cache the list of products by its URL. This is where REST comes in — it’s a set of conventions and constraints for designing elegant and scalable HTTP APIs. In the rest of this chapter, we will use REST principles where it makes sense.
我们应该如何建立关系模型?例如,特定产品是属于产品集合的资源,理想情况下应该反映在其 URL 中。因此,具有唯一标识符42的产品可以用相对URL /products/42来标识。产品还可以有一个与其关联的评论列表,我们可以通过在父资源名称/products/42之后附加嵌套资源名称Reviews来对其进行建模,例如/products/42/reviews。如果我们继续添加更多嵌套资源,API 将变得复杂。根据经验,URL 应保持简单,即使这意味着客户端可能必须执行多个请求才能获取所需的信息。
How should we model relationships? For example, a specific product is a resource that belongs to the collection of products, and that should ideally be reflected in its URL. Hence, the product with the unique identifier 42 could be identified with the relative URL /products/42. The product could also have a list of reviews associated with it, which we can model by appending the nested resource name, reviews, after the parent one, /products/42, e.g., /products/42/reviews. If we were to continue to add more nested resources, the API would become complex. As a rule of thumb, URLs should be kept simple, even if it means that the client might have to perform multiple requests to get the information it needs.
现在我们知道了如何引用资源,让我们看看当它们在请求和响应消息的正文中传输时如何在网络上表示它们。资源可以用不同的方式表示;例如,产品可以用 XML 或 JSON 文档表示。JSON 通常用于表示 REST API 中的非二进制资源:
Now that we know how to refer to resources, let’s see how to represent them on the wire when they are transmitted in the body of request and response messages. A resource can be represented in different ways; for example, a product can be represented either with an XML or a JSON document. JSON is typically used to represent non-binary resources in REST APIs:
{
"id": 42,
"category": "Laptop",
"price": 999,
}{
"id": 42,
"category": "Laptop",
"price": 999,
}当客户端向服务器发送请求以获取资源时,它会向消息添加几个标头以描述其首选表示形式。服务器使用这些标头来选择最合适的资源表示形式,并使用描述它的标头来装饰响应消息。
When a client sends a request to a server to get a resource, it adds several headers to the message to describe its preferred representation. The server uses these headers to pick the most appropriate representation for the resource and decorates the response message with headers that describe it.
HTTP 请求可以使用请求方法来创建、读取、更新和删除 (CRUD) 资源。当客户端向服务器请求特定资源时,它会指定要使用的方法。您可以将请求方法视为对资源使用的动词或操作。
HTTP requests can create, read, update, and delete (CRUD) resources by using request methods. When a client makes a request to a server for a particular resource, it specifies which method to use. You can think of a request method as the verb or action to use on a resource.
最常用的方法是POST、GET、PUT和DELETE。例如,我们的目录服务的API可以定义如下:
The most commonly used methods are POST, GET, PUT, and DELETE. For example, the API of our catalog service could be defined as follows:
请求方法可以根据是否安全和幂等进行分类。安全的方法不应该有任何明显的副作用并且可以安全地缓存。幂等方法可以执行多次,最终结果应该与执行一次相同。
Request methods can be classified depending on whether they are safe and idempotent. A safe method should not have any visible side effects and can be safely cached. An idempotent method can be executed multiple times, and the end result should be the same as if it was executed just a single time.
| 方法 | 安全的 | 幂等 |
|---|---|---|
| 得到 | 是的 | 是的 |
| 放 | 不 | 是的 |
| 邮政 | 不 | 不 |
| 删除 | 不 | 是的 |
幂等性的概念至关重要,并且将在本书的其余部分中反复出现,而不仅仅是在 HTTP 请求的上下文中。幂等请求可以安全地重试已成功但客户端从未收到响应的请求;例如,因为它在收到之前崩溃并重新启动。
The concept of idempotency is crucial and will come up repeatedly in the rest of the book, not just in the context of HTTP requests. An idempotent request makes it possible to safely retry requests that have succeeded, but for which the client never received a response; for example, because it crashed and restarted before receiving it.
服务收到请求后,需要将响应发送回客户端。HTTP 响应包含一个状态代码,用于向客户端传达请求是否成功。不同的状态码范围有不同的含义。
After the service has received a request, it needs to send a response back to the client. The HTTP response contains a status code to communicate to the client whether the request succeeded or not. Different status code ranges have different meanings.
200 到 299 之间的状态代码用于表示成功。例如,200(OK)表示请求成功,响应正文包含所请求的资源。
Status codes between 200 and 299 are used to communicate success. For example, 200 (OK) means that the request succeeded, and the body of the response contains the requested resource.
300 到 399 之间的状态代码用于重定向。例如,301(永久移动)表示请求的资源已移动到响应消息Location标头中指定的不同 URL。
Status codes between 300 and 399 are used for redirection. For example, 301 (Moved Permanently) means that the requested resource has been moved to a different URL, specified in the response message Location header.
400 到 499 之间的状态代码是为客户端错误保留的。因客户端错误而失败的请求在重试时通常会继续返回相同的错误,因为该错误是由客户端问题引起的,而不是服务器问题引起的。因此,不应重试。这些客户端错误很常见:
Status codes between 400 and 499 are reserved for client errors. A request that fails with a client error will usually continue to return the same error if it’s retried, as the error is caused by an issue with the client, not the server. Because of that, it shouldn’t be retried. These client errors are common:
500 到 599 之间的状态代码是为服务器错误保留的。因服务器错误而失败的请求可以重试,因为导致其失败的问题可能会在服务器处理重试时得到修复。这些是一些典型的服务器状态代码:
Status codes between 500 and 599 are reserved for server errors. A request that fails with a server error can be retried as the issue that caused it to fail might be fixed by the time the retry is processed by the server. These are some typical server status codes:
现在我们已经了解了如何将服务接口定义的操作映射到 RESTful HTTP 端点,我们可以使用接口定义语言(IDL)正式定义 API ,这是一种独立于 API 的描述。IDL 定义可用于以您选择的语言为 IPC 适配器和客户端 SDK 生成样板代码。
Now that we have learned how to map the operations defined by our service’s interface onto RESTful HTTP endpoints, we can formally define the API with an interface definition language (IDL), a language independent description of it. The IDL definition can be used to generate boilerplate code for the IPC adapter and client SDKs in your languages of choice.
OpenAPI规范由 Swagger 项目演变而来,是基于 HTTP 的 RESTful API 最流行的 IDL 之一。有了它,我们可以在 YAML 文档中正式描述我们的 API,包括可用的端点、每个端点支持的请求方法和响应状态代码,以及资源的 JSON 表示的架构。
The OpenAPI specification, which evolved from the Swagger project, is one of the most popular IDL for RESTful APIs based on HTTP. With it, we can formally describe our API in a YAML document, including the available endpoints, supported request methods and response status codes for each endpoint, and the schema of the resources’ JSON representation.
例如,目录服务 API 的/products端点的一部分可以这样定义:
For example, this is how part of the /products endpoint of the catalog service’s API could be defined:
openapi: 3.0.0
info:
version: "1.0.0"
title: Catalog Service API
paths:
/products:
get:
summary: List products
parameters:
- in: query
name: sort
required: false
schema:
type: string
responses:
'200':
description: list of products in catalog
content:
application/json:
schema:
type: array
items:
$ref: '#/components/schemas/ProductItem'
'400':
description: bad input
components:
schemas:
ProductItem:
type: object
required:
- id
- name
- category
properties:
id:
type: number
name:
type: string
category:
type: stringopenapi: 3.0.0
info:
version: "1.0.0"
title: Catalog Service API
paths:
/products:
get:
summary: List products
parameters:
- in: query
name: sort
required: false
schema:
type: string
responses:
'200':
description: list of products in catalog
content:
application/json:
schema:
type: array
items:
$ref: '#/components/schemas/ProductItem'
'400':
description: bad input
components:
schemas:
ProductItem:
type: object
required:
- id
- name
- category
properties:
id:
type: number
name:
type: string
category:
type: string尽管这是一个非常简单的示例,并且我们不会花时间进一步描述 OpenAPI,因为它主要是实现细节,但它应该让您了解其表达能力。有了这个定义,我们就可以运行一个工具来为我们选择的语言生成 API 文档、样板适配器和客户端 SDK。
Although this is a very simple example and we won’t spend time describing OpenAPI further as it’s mostly an implementation detail, it should give you an idea of its expressiveness. With this definition, we can then run a tool to generate the API’s documentation, boilerplate adapters, and client SDKs for our languages of choice.
API 最初是设计精美的界面。缓慢但肯定的是,它们将需要随着时间的推移而改变以适应新的用例。在改进 API 时,您最不想做的就是引入重大更改,该更改需要一致修改所有客户端,其中一些客户端可能一开始就无法控制。
APIs start out as beautifully-designed interfaces. Slowly, but surely, they will need to change over time to adapt to new use cases. The last thing you want to do when evolving your API is to introduce a breaking change that requires modifying all the clients in unison, some of which you might have no control over in the first place.
有两种类型的更改可能会破坏兼容性,一种是端点级别的更改,另一种是消息级别的更改。例如,如果您要将/products端点更改为/fancy-products,显然会破坏尚未更新以支持新端点的客户端。将以前可选的查询参数设置为强制时也是如此。
There are two types of changes that can break compatibility, one at the endpoint level and another at the message level. For example, if you were to change the /products endpoint to /fancy-products, it would obviously break clients that haven’t been updated to support the new endpoint. The same goes when making a previously optional query parameter mandatory.
以向后不兼容的方式更改请求和响应消息的模式也会造成严重破坏。例如,将Product架构中类别属性的类型从字符串更改为数字是一项重大更改,因为旧的反序列化逻辑会在客户端中崩溃。对于用其他序列化格式(例如协议缓冲区)表示的消息,可以进行类似的论证。
Changing the schema of request and response messages in a backward incompatible way can also wreak havoc. For example, changing the type of the category property in the Product schema from string to number is a breaking change as the old deserialization logic would blow up in clients. Similar arguments can be made for messages represented with other serialization formats, like Protocol Buffers.
为了支持重大更改,REST API 应通过在 URL 中添加版本号前缀(例如/v1/products/)、使用自定义标头(例如Accept-Version: v1)或带有内容协商的Accept标头(例如,接受:application/vnd.example.v1+json)。
To support breaking changes, REST APIs should be versioned by either prefixing a version number in the URLs (e.g., /v1/products/), using a custom header (e.g., Accept-Version: v1) or the Accept header with content negotiation (e.g., Accept: application/vnd.example.v1+json).
作为一般经验法则,您应该尝试以向后兼容的方式发展您的 API,除非您有充分的理由,在这种情况下您需要准备好应对后果。向后兼容的 API 往往不是特别优雅,但它们是不可避免的祸害。有些工具可以比较 API 的 IDL 规范并检查重大更改;在持续集成管道中使用它们。
As a general rule of thumb, you should try to evolve your API in a backwards-compatible way unless you have a very good reason, in which case you need to be prepared to deal with the consequences. Backwards-compatible APIs tend to be not particularly elegant, but they are a necessary evil. There are tools that can compare the IDL specifications of your API and check for breaking changes; use them in your continuous integration pipelines.
到目前为止,我们已经了解了如何使两个进程能够可靠且安全地相互通信。不过,我们并不是为了解决问题而陷入这些麻烦。最终目标始终是使用多个进程和服务来构建分布式应用程序,让客户端产生与单个节点交互的错觉。
So far, we have learned how we can get two processes to communicate reliably and securely with each other. We didn’t go into all this trouble just for the sake of it, though. The end goal has always been to use multiple processes, and services, to build a distributed application that gives its clients the illusion they interact with a single node.
尽管实现完美的幻象并不总是可能或可取的,但很明显,构建分布式应用程序需要某种程度的协调。在这一部分中,我们将探讨大规模服务的核心分布式算法。
Although achieving a perfect illusion is not always possible or desirable, it’s clear that some degree of coordination is needed to build a distributed application. In this part, we will explore the core distributed algorithms at the heart of large scale services.
第6章介绍了对节点行为、通信链路和时序的假设进行编码的形式模型;将它们视为抽象,使我们能够通过忽略用于实现分布式系统的实际技术的复杂性来推理分布式系统。
Chapter 6 introduces formal models that encode our assumptions about the behavior of nodes, communication links, and timing; think of them as abstractions that allow us to reason about distributed systems by ignoring the complexity of the actual technologies used to implement them.
第7章介绍如何检测远程进程不可访问。由于网络不可靠并且进程随时可能崩溃,因此尝试与另一个进程通信的进程可能会永远挂起而不会检测到故障。
Chapter 7 describes how to detect that a remote process is unreachable. Since the network is unreliable and processes can crash at any time, a process trying to communicate with another could hang forever without failure detection.
第8章深入探讨时间和顺序的概念。在本章中,我们将首先了解为什么在分布式系统中就事件发生的时间达成一致比看起来要困难得多,然后提出一个基于不测量时间流逝的时钟的解决方案。
Chapter 8 dives into the concept of time and order. In this chapter, we will first learn why agreeing on the time an event happened in a distributed system is much harder than it looks, and then propose a solution based on clocks that don’t measure the passing of time.
第9章描述了一组进程如何选举一个领导者,该领导者可以执行其他进程无法执行的操作,例如访问共享资源或协调其他进程的操作。
Chapter 9 describes how a group of processes can elect a leader who can perform operations that others can’t, like accessing a shared resource or coordinating other processes’ actions.
第10章介绍了分布式系统中的基本挑战之一,即保持多个节点之间的复制数据同步。本章探讨了为什么要在一致性和可用性之间进行权衡,并描述 Raft 复制算法的工作原理。
Chapter 10 introduces one of the fundamental challenges in distributed systems, namely keeping replicated data in sync across multiple nodes. This chapter explores why there is a tradeoff between consistency and availability and describes how the Raft replication algorithm works.
第11章深入探讨如何实现跨多个节点或服务分区的数据的事务。事务将您从各种可能的故障场景中解救出来,以便您可以专注于实际的应用程序逻辑,而不是所有可能出错的事情。
Chapter 11 dives into how to implement transactions that span data partitioned among multiple nodes or services. Transactions relieve you from a whole range of possible failure scenarios so that you can focus on the actual application logic rather than all possible things that can go wrong.
为了推理分布式系统,我们需要精确定义什么可以发生,什么不能发生。系统模型对有关节点行为、通信链路和时序的假设进行编码;将其视为一组假设,使我们能够通过忽略用于实现分布式系统的实际技术的复杂性来推理分布式系统。
To reason about distributed systems, we need to define precisely what can and can’t happen. A system model encodes assumptions about the behavior of nodes, communication links, and timing; think of it as a set of assumptions that allow us to reason about distributed systems by ignoring the complexity of the actual technologies used to implement them.
我们首先介绍一些通信链路的模型:
Let’s start by introducing some models for communication links:
尽管这些模型只是真实通信链路的抽象,但它们对于验证算法的正确性很有用。正如我们在前面的章节中所看到的,可以在公平损失的通信链路之上构建可靠且经过身份验证的通信链路。例如,TCP 正是执行此操作(以及更多操作),而 TLS 则实现身份验证(以及更多操作)。
Even though these models are just abstractions of real communication links, they are useful to verify the correctness of algorithms. As we have seen in the previous chapters, it’s possible to build a reliable and authenticated communication link on top of a fair-loss one. For example, TCP does precisely that (and more), while TLS implements authentication (and more).
我们还可以对预期发生的不同类型的节点故障进行建模:
We can also model the different types of node failures we expect to happen:
虽然可以采用不可靠的通信链路并使用协议将其转换为更可靠的链路(例如,不断重传丢失的消息),但对于节点来说,这是不可能的。因此,不同节点模型的算法看起来彼此非常不同。
While it’s possible to take an unreliable communication link and convert it into a more reliable one using a protocol (e.g., keep retransmitting lost messages), the equivalent isn’t possible for nodes. Because of that, algorithms for different node models look very different from each other.
拜占庭节点模型通常用于对安全关键系统进行建模,例如飞机发动机系统、核电站、金融系统以及单个实体无法完全控制所有节点的其他系统1。这些用例超出了本书的范围,并且所提供的算法通常会假设崩溃恢复模型。
Byzantine node models are typically used to model safety-critical systems like airplane engine systems, nuclear power plants, financial systems, and other systems where a single entity doesn’t fully control all the nodes1. These use cases are outside of the book’s scope, and the algorithms presented will generally assume a crash-recovery model.
最后,我们还可以对时序假设进行建模:
Finally, we can also model the timing assumptions:
在本书的其余部分中,我们通常会假设一个具有公平损失链路、具有崩溃恢复行为的节点和部分同步的系统模型。对于感兴趣的读者来说,《可靠和安全分布式编程简介》是一本优秀的理论书,它探讨了本文中未考虑的各种其他系统模型的分布式算法。
In the rest of the book, we will generally assume a system model with fair-loss links, nodes with crash-recovery behavior, and partial synchrony. For the interested reader, “Introduction to Reliable and Secure Distributed Programming” is an excellent theoretical book that explores distributed algorithms for a variety of other system models not considered in this text.
但请记住,模型只是现实的抽象,有时抽象会泄漏。当您阅读时,质疑模型的假设并尝试想象依赖它们的算法会如何崩溃。
But remember, models are just an abstraction of reality, and sometimes abstractions leak. As you read along, question the models’ assumptions and try to imagine how algorithms that rely on them could break.
当客户端向服务器发送请求时,可能会出现一些问题。在快乐路径中,客户端发送请求并接收返回的响应。但是,如果一段时间后没有回复怎么办?在这种情况下,无法判断服务器是否只是非常慢、崩溃或由于网络问题而无法传递消息(参见图 7.1 )。
Several things can go wrong when a client sends a request to a server. In the happy path, the client sends a request and receives a response back. But, what if no response comes back after some time? In that case, it’s impossible to tell whether the server is just very slow, it crashed, or a message couldn’t be delivered because of a network issue (see Figure 7.1).
图 7.1:P1 无法判断 P2 是否速度缓慢、崩溃或消息因网络问题而延迟/丢失。
Figure 7.1: P1 can’t tell whether P2 is slow, crashed or a message was delayed/dropped because of a network issue.
在最坏的情况下,客户端将永远等待永远不会到达的响应。它能做的最好的事情就是对服务器是否可能在一段时间后关闭或无法访问进行有根据的猜测。为此,客户端可以配置一个超时,以便在一定时间后未收到服务器的响应时触发。如果超时触发,客户端会认为服务器不可用并抛出错误。
In the worst case, the client will wait forever for a response that will never arrive. The best it can do is make an educated guess on whether the server is likely to be down or unreachable after some time has passed. To do that, the client can configure a timeout to trigger if it hasn’t received a response from the server after a certain amount of time. If and when the timeout triggers, the client considers the server unavailable and throws an error.
棘手的部分是定义触发此超时的时间长度应该是多长。如果太短并且服务器可达,客户端会错误地认为服务器已死亡;如果太长并且无法访问服务器,客户端将阻塞等待响应。最重要的是,不可能构建完美的故障检测器。
The tricky part is defining how long the amount of time that triggers this timeout should be. If it’s too short and the server is reachable, the client will wrongly consider the server dead; if it’s too long and the server is not reachable, the client will block waiting for a response. The bottom line is that it’s not possible to build a perfect failure detector.
进程不一定需要等待发送消息才能发现目的地不可到达。它还可以主动尝试使用 ping 或心跳来维护可用进程列表。
A process doesn’t necessarily need to wait to send a message to find out that the destination is not reachable. It can also actively try to maintain a list of processes that are available using pings or heartbeats.
ping是一个进程向另一个进程发送的定期请求,以检查它是否仍然可用。该过程期望在特定时间范围内对 ping 做出响应。如果没有发生,则会触发超时,将目标标记为已死亡。但是,该进程将继续定期向其发送 ping,以便当它重新联机时,它将回复 ping 并再次标记为可用。
A ping is a periodic request that a process sends to another to check whether it’s still available. The process expects a response to the ping within a specific time frame. If that doesn’t happen, a timeout is triggered that marks the destination as dead. However, the process will keep regularly sending pings to it so that if and when it comes back online, it will reply to a ping and be marked as available again.
心跳是一个进程定期向另一个进程发送的消息,通知该进程仍在运行。如果目的地在特定时间范围内没有收到心跳,则会触发超时并将错过心跳的进程标记为死亡。如果该进程稍后恢复并开始发送心跳,它最终将被再次标记为可用。
A heartbeat is a message that a process periodically sends to another to inform it that it’s still up and running. If the destination doesn’t receive a heartbeat within a specific time frame, it triggers a timeout and marks the process that missed the heartbeat as dead. If that process comes later back to life and starts sending out heartbeats, it will eventually be marked as available again.
当特定进程频繁相互交互并且其中一个进程不再可达时需要立即采取操作时,通常会使用 Ping 和心跳。如果情况并非如此,仅在通信时检测故障就足够了。
Pings and heartbeats are typically used when specific processes frequently interact with each other, and an action needs to be taken as soon as one of them is no longer reachable. If that’s not the case, detecting failures just at communication time is good enough.
时间是任何应用程序中的一个基本概念,在分布式应用程序中更是如此。在讨论网络堆栈(例如,DNS 记录 TTL)和故障检测时,我们已经遇到了它的一些用途。时间在通过记录时间戳来重建操作顺序方面也发挥着重要作用。
Time is an essential concept in any application, even more so in distributed ones. We have already encountered some use for it when discussing the network stack (e.g., DNS record TTL) and failure detection. Time also plays an important role in reconstructing the order of operations by logging their timestamps.
单线程应用程序的执行流程很容易掌握,因为每个操作都是按时间顺序执行的,一个接一个。但在分布式系统中,不存在所有进程都同意并可用于排序其操作的共享全局时钟。更糟糕的是,进程可以同时运行。
The flow of execution of a single-threaded application is easy to grasp since every operation executes sequentially in time, one after the other. But in a distributed system, there is no shared global clock that all processes agree on and can be used to order their operations. And to make matters worse, processes can run concurrently.
在不知道一项操作是否先于另一项操作的情况下构建按预期工作的分布式应用程序是一项挑战。您能想象设计一个类似 TCP 的协议而不使用序列号来对数据包进行排序吗?在本章中,我们将了解一系列时钟,它们可用于计算分布式系统中跨进程的操作顺序。
It’s challenging to build distributed applications that work as intended without knowing whether one operation happened before another. Can you imagine designing a TCP-like protocol without using sequence numbers to order the packets? In this chapter, we will learn about a family of clocks that can be used to work out the order of operations across processes in a distributed system.
进程可以访问物理挂钟。最常见的类型是基于振动石英晶体,这种类型价格便宜但不太准确。您用来阅读本书的设备很可能使用这样的时钟。它的运行速度可能比其他设备稍快或稍慢,具体取决于制造差异和外部温度。时钟运行的速率也称为时钟漂移。相反,两个时钟在特定时间点之间的差异称为时钟偏差。
A process has access to a physical wall-time clock. The most common type is based on a vibrating quartz crystal, which is cheap but not very accurate. The device you are using to read this book is likely using such a clock. It can run slightly faster or slower than others, depending on manufacturing differences and the external temperature. The rate at which a clock runs is also called clock drift. In contrast, the difference between two clocks at a specific point in time is referred to as clock skew.
由于石英钟会漂移,因此它们需要定期与能够使用更高精度时钟(例如原子时钟)的机器同步。原子钟根据原子的量子力学特性来测量时间,比石英钟贵得多,并且精确到 300 万年 1 秒。
Because quartz clocks drift, they need to be synced periodically with machines that have access to higher accuracy clocks, like atomic ones. Atomic clocks measure time based on quantum-mechanical properties of atoms and are significantly more expensive than quartz clocks and are accurate to 1 second in 3 million years.
网络时间协议 ( NTP ) 用于同步时钟。尽管网络带来了不可预测的延迟,但要做到这一点仍面临着挑战。NTP 客户端通过使用估计的网络延迟纠正 NTP 服务器接收到的时间戳来估计时钟偏差。通过对时钟偏差的估计,客户端可以调整其时钟,使其及时向前或向后跳跃。
The Network Time Protocol (NTP) is used to synchronize clocks. The challenge is to do so despite the unpredictable latencies introduced by the network. A NTP client estimates the clock skew by correcting the timestamp received by a NTP server with the estimated network latency. Armed with an estimate of the clock skew, the client can adjust its clock, causing it to jump forward or backward in time.
这会产生一个问题,因为测量两个时间点之间经过的时间很容易出错。例如,在另一个操作之后执行的操作可能看起来之前已经执行过。
This creates a problem as measuring the elapsed time between two points in time becomes error-prone. For example, an operation that is executed after another could appear to have been executed before.
幸运的是,大多数操作系统提供了一种不受时间跳跃影响的不同类型的时钟:单调时钟。单调时钟测量自任意点(例如节点启动时)以来经过的秒数,并且只能及时向前移动。单调时钟可用于测量同一节点上两个时间戳之间经过的时间,但不同节点的时间戳无法相互比较。
Luckily, most operating systems offer a different type of clock that is not affected by time jumps: the monotonic clock. A monotonic clock measures the number of seconds elapsed since an arbitrary point, like when the node started up, and can only move forward in time. A monotonic clock is useful to measure how much time elapsed between two timestamps on the same node, but timestamps of different nodes can’t be compared with each other.
由于我们没有办法在进程之间完美同步挂钟,因此我们不能依赖它们来进行排序操作。要解决这个问题,我们需要从另一个角度来看待。我们知道两个操作不能在单线程进程中同时运行,因为一个操作必须在另一个操作之前发生。这种事前发生的关系在两个操作之间创建了因果关系,因为先发生的操作可以更改流程的状态并影响其后的操作。我们可以利用这种直觉来构建一种不同类型的时钟,一种与时间的物理概念无关的时钟,但捕获操作之间的因果关系:逻辑时钟。
Since we don’t have a way to synchronize wall-time clocks across processes perfectly, we can’t depend on them for ordering operations. To solve this problem, we need to look at it from another angle. We know that two operations can’t run concurrently in a single-threaded process as one must happen before the other. This happened-before relationship creates a causal bond between the two operations, as the one that happens first can change the state of the process and affect the operation that comes after it. We can use this intuition to build a different type of clock, one that isn’t tied to the physical concept of time, but captures the causal relationship between operations: a logical clock.
逻辑时钟根据逻辑运算来测量时间的流逝,而不是挂钟时间。最简单的逻辑时钟是计数器,它在执行操作之前递增。这样做可以确保每个操作都有不同的逻辑时间戳。如果两个操作在同一进程上执行,则其中一个操作必须先于另一个操作,并且它们的逻辑时间戳将反映这一点。但是在不同进程上执行的操作又如何呢?
A logical clock measures the passing of time in terms of logical operations, not wall-clock time. The simplest possible logical clock is a counter, which is incremented before an operation is executed. Doing so ensures that each operation has a distinct logical timestamp. If two operations execute on the same process, then necessarily one must come before the other, and their logical timestamps will reflect that. But what about operations executed on different processes?
想象一下向朋友发送电子邮件。您在发送该电子邮件之前所做的任何操作(例如喝咖啡)必须发生在您的朋友收到电子邮件后采取的操作之前。类似地,当一个进程向另一个进程发送消息时,就会创建一个所谓的同步点。发送方在发送消息之前执行的操作必须发生在接收方收到消息后执行的操作之前。
Imagine sending an email to a friend. Any actions you did before sending that email, like drinking coffee, must have happened before the actions your friend took after receiving the email. Similarly, when one process sends a message to another, a so-called synchronization point is created. The operations executed by the sender before the message was sent must have happened before the operations that the receiver executed after receiving it.
Lamport时钟是基于这个想法的逻辑时钟。系统中的每个进程都有自己的本地逻辑时钟,通过遵循特定规则的数字计数器实现:
A Lamport clock is a logical clock based on this idea. Every process in the system has its own local logical clock implemented with a numerical counter that follows specific rules:
图 8.1:使用 Lamport 时钟的三个进程。例如,因为 D 发生在 F 之前,所以 D 的逻辑时间戳小于 F 的逻辑时间戳。
Figure 8.1: Three processes using Lamport clocks. For example, because D happened before F, D’s logical timestamp is less than F’s.
Lamport 时钟采用崩溃停止模型,但可以通过将时钟状态保存在磁盘上来支持崩溃恢复模型。
The Lamport clock assumes a crash-stop model, but a crash-recovery one can be supported by persisting the clock’s state on disk, for example.
规则保证 if 操作发生在操作之前,逻辑时间戳小于以下之一。在图8.1所示的示例中,操作 D 发生在 F 之前,并且它们的逻辑时间戳 4 和 5 反映了这一点。
The rules guarantee that if operation happened-before operation , the logical timestamp of is less than the one of . In the example shown in Figure 8.1, operation D happened-before F and their logical timestamps, 4 and 5, reflect that.
您可能会认为反过来也适用 - 如果操作的逻辑时间戳小于, 然后发生在之前。但是,Lamport 时间戳不能保证这一点。回到图8.1中的示例,操作 E 并未在 C 之前发生,即使它们的时间戳似乎暗示了这一点。为了保证相反的关系,我们必须使用不同类型的逻辑时钟:矢量时钟。
You would think that the converse also applies — if the logical timestamp of operation is less than , then happened-before . But, that can’t be guaranteed with Lamport timestamps. Going back to the example in Figure 8.1, operation E didn’t happen-before C, even if their timestamps seem to imply it. To guarantee the converse relationship, we will have to use a different type of logical clock: the vector clock.
矢量时钟是一种逻辑时钟,它保证如果两个操作可以按其逻辑时间戳排序,则其中一个操作一定先于另一个操作发生。矢量时钟是通过一组计数器来实现的,系统中的每个进程都有一个计数器。与 Lamport 时钟的使用方式类似,每个进程都有自己的本地时钟副本。
A vector clock is a logical clock that guarantees that if two operations can be ordered by their logical timestamps, then one must have happened-before the other. A vector clock is implemented with an array of counters, one for each process in the system. And similarly to how Lamport clocks are used, each process has its own local copy of the clock.
例如,如果系统由3个进程组成,, 和,每个进程都有一个本地矢量时钟,通过3 个计数器的数组1实现。数组中的第一个计数器与,第二个与,第三个是。
For example, if the system is composed of 3 processes , , and , each process has a local vector clock implemented with an array1 of 3 counters . The first counter in the array is associated with , the second with , and the third with .
进程根据以下规则更新其本地矢量时钟:
A process updates its local vector clock based on the following rules:
图 8.2:每个进程都有一个用三个计数器组成的数组表示的矢量时钟。
Figure 8.2: Each process has a vector clock represented with an array of three counters.
矢量时钟时间戳的优点在于它们可以部分排序;给定两个操作和带时间戳和, 如果:
The beauty of vector clock timestamps is that they can be partially ordered; given two operations and with timestamps and , if:
然后发生在之前。例如,在图8.2中,B 发生在 C 之前。
then happened-before . For example, in Figure 8.2, B happened-before C.
如果以前没有发生过和以前没有发生过,则时间戳无法排序,操作被认为是并发的。例如,图8.2中的操作E和C不能排序,因此它们被认为是并发的。
If didn’t happen before and didn’t happen before , then the timestamps can’t be ordered, and the operations are considered to be concurrent. For example, operation E and C in Figure 8.2 can’t be ordered, and therefore they are considered to be concurrent.
关于逻辑时钟的讨论可能感觉很抽象。在本书的后面,我们将遇到逻辑时钟的一些实际应用。一旦你学会识别它们,你就会意识到它们无处不在,因为它们可以用不同的名字伪装。此时要内化的重要一点是,通常您无法使用物理时钟来准确推导不同进程上发生的事件的顺序2。
This discussion about logical clocks might feel quite abstract. Later in the book, we will encounter some practical applications of logical clocks. Once you learn to spot them, you will realize they are everywhere, as they can be disguised under different names. What’s important to internalize at this point is that generally, you can’t use physical clocks to derive accurately the order of events that happened on different processes2 .
在实际实现中,使用字典而不是数组。↩︎
In actual implementations a dictionary is used rather than an array.↩︎
也就是说,有时物理时钟就足够好了。例如,使用物理时钟为日志添加时间戳就可以了,因为它们主要用于调试目的。↩︎
That said, sometimes physical clocks are good enough. For example, using physical clocks to timestamp logs is fine as they are mostly used for debugging purposes.↩︎
有时系统中的单个进程需要拥有特殊的权力,例如成为唯一可以访问共享资源或将工作分配给其他人的进程。为了授予进程这些权力,系统需要在一组候选进程中选出一个领导者,该领导者将一直负责,直到崩溃或变得不可用。当这种情况发生时,其余进程会检测到领导者不再可用并选举一位新领导者。
Sometimes a single process in the system needs to have special powers, like being the only one that can access a shared resource or assign work to others. To grant a process these powers, the system needs to elect a leader among a set of candidate processes, which remains in charge until it crashes or becomes otherwise unavailable. When that happens, the remaining processes detect that the leader is no longer available and elect a new one.
领导者选举算法需要保证在任何给定时间最多有一名领导者,并且选举最终完成。这两个属性也分别称为安全性和活性。本章探讨了一种特定的算法——Raft领导者选举算法——如何保证这些属性。
A leader election algorithm needs to guarantee that there is at most one leader at any given time and that an election eventually completes. These two properties are also referred to as safety and liveness, respectively. This chapter explores how a specific algorithm, the Raft leader election algorithm, guarantees these properties.
Raft的领导者选举算法是通过状态机实现的,其中进程处于三种状态之一(见图9.1):
Raft’s leader election algorithm is implemented with a state machine in which a process is in one of three states (see Figure 9.1):
在 Raft 中,时间被划分为任意长度的选举项。选举任期用一个逻辑时钟来表示,这是一个只能随着时间的推移而增加的数字计数器。任期以新的选举开始,在此期间,一名或多名候选人试图成为领导者。该算法保证对于任何术语最多有一个领导者。但首先是什么引发了选举呢?
In Raft, time is divided into election terms of arbitrary length. An election term is represented with a logical clock, a numerical counter that can only increase over time. A term begins with a new election, during which one or more candidates attempt to become the leader. The algorithm guarantees that for any term there is at most one leader. But what triggers an election in the first place?
当系统启动时,所有进程都作为追随者开始其旅程。追随者希望从领导者那里收到定期的心跳,其中包含领导者当选的选举期限。如果追随者在特定时间段内没有收到任何心跳,则会触发超时,并假定领导者已死亡。此时,追随者通过增加当前选举期限并过渡到候选状态来开始新的选举。然后它为自己投票,并向系统中的所有进程发送请求为其投票,并在请求上标记当前的选举任期。
When the system starts up, all processes begin their journey as followers. A follower expects to receive a periodic heartbeat from the leader containing the election term the leader was elected in. If the follower doesn’t receive any heartbeat within a certain time period, a timeout fires and the leader is presumed dead. At that point, the follower starts a new election by incrementing the current election term and transitioning to the candidate state. It then votes for itself and sends a request to all the processes in the system to vote for it, stamping the request with the current election term.
该进程将保持候选状态,直到发生以下三种情况之一:它赢得选举,另一个进程赢得选举,或者一段时间后没有获胜者:
The process remains in the candidate state until one of three things happens: it wins the election, another process wins the election, or some time goes by with no winner:
图 9.1:用状态机表示的 Raft 领导者选举算法。
Figure 9.1: Raft’s leader election algorithm represented as a state machine.
除了这里介绍的算法之外,还有更多的领导者选举算法,但 Raft 的实现是针对该问题的一种现代解决方案,为了简单性和可理解性而进行了优化,这就是我选择它的原因。也就是说,您很少需要从头开始实现领导者选举,因为您可以利用线性化键值存储,例如etcd或ZooKeeper,它们提供的抽象可以轻松实现领导者选举。抽象范围从基本原语(如比较和交换)到成熟的分布式互斥体。
There are many more leader election algorithms out there than the one presented here, but Raft’s implementation is a modern take on the problem optimized for simplicity and understandability, which is why I chose it. That said, you will rarely need to implement leader election from scratch, as you can leverage linearizable key-value stores, like etcd or ZooKeeper, which offer abstractions that make it easy to implement leader election. The abstractions range from basic primitives like compare-and-swap to full-fledged distributed mutexes.
理想情况下,外部存储至少应该提供具有过期时间 (TTL) 的原子比较和交换操作。当且仅当值与预期值匹配时,比较和交换操作才会更新键的值;过期时间定义密钥的生存时间,在此之后,如果租约尚未延长,则密钥将过期并从存储中删除。这个想法是,每个竞争进程都尝试通过使用特定 TTL 进行比较和交换来创建新密钥来获取租约。第一个成功的进程成为领导者,并保持这种状态直到它停止更新租约,之后另一个进程可以成为领导者。
Ideally, the external store should at the very least offer an atomic compare-and-swap operation with an expiration time (TTL). The compare-and-swap operation updates the value of a key if and only if the value matches the expected one; the expiration time defines the time to live for a key, after which the key expires and is removed from the store if the lease hasn’t been extended. The idea is that each competing process tries to acquire a lease by creating a new key with compare-and-swap using a specific TTL. The first process to succeed becomes the leader and remains such until it stops renewing the lease, after which another process can become the leader.
TTL过期逻辑也可以在客户端实现,就像DynamoDB的锁定库一样,但实现更复杂,并且仍然需要数据存储提供比较和交换操作。
The TTL expiry logic can also be implemented on the client-side, like this locking library for DynamoDB does, but the implementation is more complex, and it still requires the data store to offer a compare-and-swap operation.
您可能认为这足以保证您的应用程序中不会有超过一位领导者。不幸的是,事实并非如此。
You might think that’s enough to guarantee there can’t be more than one leader in your application. Unfortunately, that’s not the case.
要了解原因,假设有多个进程需要更新共享 Blob 存储上的文件,并且您希望保证一次只有一个进程可以执行此操作以避免竞争条件。为了实现这一目标,您决定使用分布式互斥体,这是领导者选举的一种形式。每个进程都尝试获取锁,成功获取该锁的进程会读取该文件,在内存中更新该文件,然后将其写回到存储区:
To see why, suppose there are multiple processes that need to update a file on a shared blob store, and you want to guarantee that only a single process at a time can do so to avoid race conditions. To achieve that, you decide to use a distributed mutex, a form of leader election. Each process tries to acquire the lock, and the one that does so successfully reads the file, updates it in memory, and writes it back to the store:
if lock.acquire():
try:
content = store.read(blob_name)
new_content = update(content)
store.write(blob_name, new_content)
except:
lock.release()if lock.acquire():
try:
content = store.read(blob_name)
new_content = update(content)
store.write(blob_name, new_content)
except:
lock.release()这里的问题是,当进程将内容写入存储时,它可能不再是领导者,并且自当选以来可能发生了很多事情。例如,操作系统可能已经抢占并停止了该进程,并且在其再次运行时已经过去了几秒钟。那么这个进程如何才能确保它仍然是领导者呢?它可以在写入存储之前再检查一次,但这并不能消除竞争条件,只是降低了发生竞争的可能性。
The problem here is that by the time the process writes the content to the store, it might no longer be the leader and a lot might have happened since it was elected. For example, the operating system might have preempted and stopped the process, and several seconds will have passed by the time it’s running again. So how can the process ensure that it’s still the leader then? It could check one more time before writing to the store, but that doesn’t eliminate the race condition, it just makes it less likely.
为了避免这个问题,数据存储下游需要验证请求是否是由当前领导者发送的。一种方法是使用隔离令牌。隔离令牌是每次获取分布式锁时都会增加的数字 - 换句话说,它是一个逻辑时钟。当领导者写入存储时,它将防护令牌传递给它。存储会记住最后一个令牌的值,并且只接受具有更大值的写入:
To avoid this issue, the data store downstream needs to verify that the request has been sent by the current leader. One way to do that is by using a fencing token. A fencing token is a number that increases every time that a distributed lock is acquired — in other words, it’s a logical clock. When the leader writes to the store, it passes down the fencing token to it. The store remembers the value of the last token and accepts only writes with a greater value:
success, token = lock.acquire()
if success:
try:
content = store.read(blob_name)
new_content = update(content)
store.write(blob_name, new_content, token)
except:
lock.release()success, token = lock.acquire()
if success:
try:
content = store.read(blob_name)
new_content = update(content)
store.write(blob_name, new_content, token)
except:
lock.release()这种方法增加了复杂性,因为下游消费者(blob 存储)需要支持隔离令牌。如果没有,那么你就不走运了,你将不得不围绕这样一个事实来设计你的系统:偶尔会有多个领导者。例如,如果暂时有两个领导者并且他们都执行相同的幂等操作,则不会造成任何损害。
This approach adds complexity as the downstream consumer, the blob store, needs to support fencing tokens. If it doesn’t, you are out of luck, and you will have to design your system around the fact that occasionally there will be more than one leader. For example, if there are momentarily two leaders and they both perform the same idempotent operation, no harm is done.
尽管拥有领导者可以简化系统的设计,因为它消除了并发性,但如果领导者执行的操作数量增加到无法跟上的程度,它可能会成为扩展瓶颈。当这种情况发生时,您可能被迫重新设计整个系统。
Although having a leader can simplify the design of a system as it eliminates concurrency, it can become a scaling bottleneck if the number of operations performed by the leader increases to the point where it can no longer keep up. When that happens, you might be forced to re-design the whole system.
此外,拥有领导者会带来大范围爆炸半径的单点故障;如果选举过程停止工作或者领导者没有按预期工作,它可能会导致整个系统崩溃。
Also, having a leader introduces a single point of failure with a large blast radius; if the election process stops working or the leader isn’t working as expected, it can bring down the entire system with it.
您可以通过引入分区并为每个分区分配不同的领导者来减轻其中一些缺点,但这会带来额外的复杂性。这是许多分布式数据存储使用的解决方案。
You can mitigate some of these downsides by introducing partitions and assigning a different leader per partition, but that comes with additional complexity. This is the solution many distributed data stores use.
在考虑使用领导者之前,请检查是否有其他方法可以在没有领导者的情况下实现所需的功能。例如,乐观锁定是保证互斥的一种方法,但代价是浪费一些计算能力。或者也许高可用性不是您的应用程序的要求,在这种情况下,只有一个偶尔崩溃和重新启动的进程并不是什么大问题。
Before considering the use of a leader, check whether there are other ways of achieving the desired functionality without it. For example, optimistic locking is one way to guarantee mutual exclusion at the cost of wasting some computing power. Or perhaps high availability is not a requirement for your application, in which case having just a single process that occasionally crashes and restarts is not a big deal.
根据经验,如果您必须使用领导者选举,则必须最大限度地减少其执行的工作,并准备好偶尔拥有多个领导者(如果您无法端到端支持 fencing 代币)。
As a rule of thumb, if you must use leader election, you have to minimize the work it performs and be prepared to occasionally have more than one leader if you can’t support fencing tokens end-to-end.
数据复制是分布式系统的基本构建块。复制数据的原因之一是提高可用性。如果某些数据仅存储在单个节点上,并且该节点出现故障,则将无法再访问这些数据。但如果数据被复制,客户端就可以无缝切换到副本。复制的另一个原因是提高可扩展性和性能;副本越多,可以同时访问数据的客户端就越多,而不会降低性能。
Data replication is a fundamental building block of distributed systems. One reason to replicate data is to increase availability. If some data is stored exclusively on a single node, and that node goes down, the data won’t be accessible anymore. But if the data is replicated instead, clients can seamlessly switch to a replica. Another reason for replication is to increase scalability and performance; the more replicas there are, the more clients can access the data concurrently without hitting performance degradations.
不幸的是,复制数据并不简单,因为保持副本彼此一致具有挑战性。在本章中,我们将探讨Raft 的复制算法,它是提供最强一致性保证的算法之一——保证对客户端来说,数据似乎位于单个节点上,即使它实际上是复制的。
Unfortunately replicating data is not simple, as it’s challenging to keep replicas consistent with one another. In this chapter, we will explore Raft’s replication algorithm, which is one of the algorithms that provide the strongest consistency guarantee possible — the guarantee that to the clients, the data appears to be located on a single node, even if it’s actually replicated.
Raft 基于一种称为状态机复制的技术。其背后的主要思想是,单个进程(领导者)将改变其状态的操作广播给其他进程(追随者)。如果追随者执行与领导者相同的操作序列,则每个追随者的状态将与领导者的状态匹配。不幸的是,领导者不能简单地向追随者广播操作并结束,因为任何进程都可能随时失败,并且网络可能会丢失消息。这就是为什么算法的很大一部分致力于容错。
Raft is based on a technique known as state machine replication. The main idea behind it is that a single process, the leader, broadcasts the operations that change its state to other processes, the followers. If the followers execute the same sequence of operations as the leader, then the state of each follower will match the leader’s. Unfortunately, the leader can’t simply broadcast operations to the followers and call it a day, as any process can fail at any time, and the network can lose messages. This is why a large part of the algorithm is dedicated to fault-tolerance.
当系统启动时,使用 Raft 的领导者选举算法选举领导者,我们在第9章中讨论过该算法。领导者是唯一可以更改复制状态的进程。它通过将改变状态的操作序列存储到本地有序日志中来实现这一点,然后将其复制到追随者;日志的复制允许跨进程复制状态。
When the system starts up, a leader is elected using Raft’s leader election algorithm, which we discussed in chapter 9. The leader is the only process that can make changes to the replicated state. It does so by storing the sequence of operations that alter the state into a local ordered log, which it then replicates to the followers; it’s the replication of the log that allows the state to be replicated across processes.
如图10.1所示,日志是条目的有序列表,其中每个条目包括:
As shown in Figure 10.1, a log is an ordered list of entries where each entry includes:
图 10.1:领导者的日志被复制到其追随者。这个数字出现在 Raft 的论文中。
Figure 10.1: The leader’s log is replicated to its followers. This figure appears in Raft’s paper.
当领导者想要将操作应用于其本地状态时,它首先将该操作的新日志条目附加到其日志中。此时,该操作尚未应用于本地状态;它仅被记录。
When the leader wants to apply an operation to its local state, it first appends a new log entry for the operation into its log. At this point, the operation hasn’t been applied to the local state just yet; it has only been logged.
然后,领导者向每个追随者发送所谓的AppendEntries请求,并添加要添加的新条目。即使没有新条目,该消息也会定期发送,因为它充当领导者的心跳。
The leader then sends a so-called AppendEntries request to each follower with the new entry to be added. This message is also sent out periodically, even in the absence of new entries, as it acts as a heartbeat for the leader.
当追随者收到AppendEntries请求时,它将收到的条目追加到其日志中,并向领导者发回响应以确认请求已成功。当领导者成功收到大多数追随者的回复时,它认为条目已提交并在其本地状态上执行操作。
When a follower receives an AppendEntries request, it appends the entry it received to its log and sends back a response to the leader to acknowledge that the request was successful. When the leader hears back successfully from a majority of followers, it considers the entry to be committed and executes the operation on its local state.
提交的日志条目被认为是持久的,最终将由所有可用的追随者执行。领导者跟踪日志中最高提交的索引,该索引在所有未来的AppendEntries请求中发送。当追随者发现领导者已提交该条目时,仅将日志条目应用到其本地状态。
The committed log entry is considered to be durable and will eventually be executed by all available followers. The leader keeps track of the highest committed index in the log, which is sent in all future AppendEntries requests. A follower only applies a log entry to its local state when it finds out that the leader has committed the entry.
因为领导者只需要等待大多数追随者,所以即使某些进程宕机,它也可以取得进展,即如果有追随者,系统最多可以容忍失败。该算法保证提交的条目是持久的,并且最终将由系统中的所有进程执行,而不仅仅是那些属于原始多数进程的进程。
Because the leader needs to wait only for a majority of followers, it can make progress even if some processes are down, i.e., if there are followers, the system can tolerate up to failures. The algorithm guarantees that an entry that is committed is durable and will eventually be executed by all the processes in the system, not just those that were part of the original majority.
到目前为止,我们假设没有出现任何故障,并且网络是可靠的。让我们放宽这些假设。如果领导者失败,则选举一个追随者作为新的领导者。但是,有一个警告:由于复制算法只需要大多数进程才能取得进展,因此当领导者失败时,某些进程可能不是最新的。
So far, we have assumed there are no failures, and the network is reliable. Let’s relax these assumptions. If the leader fails, a follower is elected as the new leader. But, there is a caveat: because the replication algorithm only needs a majority of the processes to make progress, it’s possible that when a leader fails, some processes are not up-to-date.
为了避免过时的进程成为领导者,进程不能投票给日志更新程度较低的进程。换句话说,如果进程不包含所有已提交的条目,则它无法赢得选举。为了确定两个进程的日志中哪一个是最新的,需要比较它们最后条目的索引和术语。如果日志以不同的术语结尾,则具有较晚术语的日志会更新。如果日志以相同的术语结尾,则较长的日志较最新。由于选举需要多数票,并且候选人的日志必须至少与该多数票中的任何其他进程一样最新才能赢得选举,因此当选的进程将包含所有已提交的条目。
To avoid that an out-of-date process becomes the leader, a process can’t vote for one with a less up-to-date log. In other words, a process can’t win an election if it doesn’t contain all committed entries. To determine which of two processes’ logs is more up-to-date, the index and term of their last entries are compared. If the logs end with different terms, the log with the later term is more up-to-date. If the logs end with the same term, whichever log is longer is more up-to-date. Since the election requires a majority vote, and a candidate’s log must be at least as up-to-date as any other process in that majority to win the election, the elected process will contain all committed entries.
如果追随者失败怎么办?如果无法将AppendEntries请求传递给一个或多个关注者,领导者将无限期地重试发送该请求,直到大多数关注者成功将其附加到其日志中。重试是无害的,因为AppendEntries请求是幂等的,并且关注者会忽略已附加到其日志中的日志条目。
What if a follower fails? If an AppendEntries request can’t be delivered to one or more followers, the leader will retry sending it indefinitely until a majority of the followers successfully appended it to their logs. Retries are harmless as AppendEntries requests are idempotent, and followers ignore log entries that have already been appended to their logs.
那么,当暂时不可用的关注者重新上线时会发生什么?复活的追随者最终将收到来自领导者的带有日志条目的AppendEntries消息。AppendEntries消息包括日志中紧邻要追加的条目之前的条目的索引和术语编号。如果追随者找不到具有相同索引和术语编号的日志条目,它将拒绝该消息,以确保对其日志的追加不会造成漏洞。这就好像领导者正在发送一块拼图,追随者无法将其放入其版本的拼图中。
So what happens when a follower that was temporarily unavailable comes back online? The resurrected follower will eventually receive an AppendEntries message with a log entry from the leader. The AppendEntries message includes the index and term number of the entry in the log that immediately precedes the one to be appended. If the follower can’t find a log entry with the same index and term number, it rejects the message, ensuring that an append to its log can’t create a hole. It’s as if the leader is sending a puzzle piece that the follower can’t fit in its version of the puzzle.
当AppendEntries请求被拒绝时,领导者会重试发送消息,这次包括最后两个日志条目 - 这就是我们将该请求称为AppendEntries而不是AppendEntry的原因。这种舞蹈一直持续到追随者最终接受可以附加到其日志而不创建漏洞的日志条目列表。尽管交换的消息数量可以优化,但其背后的想法是相同的:追随者等待完全适合其拼图版本的拼图块列表。
When the AppendEntries request is rejected, the leader retries sending the message, this time including the last two log entries — this is why we referred to the request as AppendEntries, and not as AppendEntry. This dance continues until the follower finally accepts a list of log entries that can be appended to its log without creating a hole. Although the number of messages exchanged can be optimized, the idea behind it is the same: the follower waits for a list of puzzle pieces that perfectly fit its version of the puzzle.
状态机复制的用途不仅仅是复制数据,因为它是共识问题的解决方案。共识是分布式系统研究中研究的一个基本问题,它需要一组进程以容错的方式就某个值达成一致,以便:
State machine replication can be used for much more than just replicating data since it’s a solution to the consensus problem. Consensus is a fundamental problem studied in distributed systems research, which requires a set of processes to agree on a value in a fault-tolerant way so that:
共识有大量的实际应用。例如,一组进程同意哪个进程应该持有锁或提交事务,这就是变相的共识问题。事实证明,决定一个值可以通过状态机复制来解决。因此,任何需要共识的问题也可以通过状态机复制来解决。
Consensus has a large number of practical applications. For example, a set of processes agreeing which one should hold a lock or commit a transaction are consensus problems in disguise. As it turns out, deciding on a value can be solved with state machine replication. Hence, any problem that requires consensus can be solved with state machine replication too.
通常,当你遇到需要达成共识的问题时,你最不想做的就是通过实现像 Raft 这样的算法从头开始解决它。虽然了解共识是什么以及如何解决它很重要,但许多优秀的开源项目都实现了状态机复制并在其之上公开了简单的 API,例如 etcd 和 ZooKeeper。
Typically, when you have a problem that requires consensus, the last thing you want to do is to solve it from scratch by implementing an algorithm like Raft. While it’s important to understand what consensus is and how it can be solved, many good open-source projects implement state machine replication and expose simple APIs on top of it, like etcd and ZooKeeper.
让我们仔细看看当客户端向复制存储发送请求时会发生什么。在理想情况下,请求会立即执行,如图10.2所示。
Let’s take a closer look at what happens when a client sends a request to a replicated store. In an ideal world, the request executes instantaneously, as shown in Figure 10.2.
图 10.2:立即执行的写入请求。
Figure 10.2: A write request executing instantaneously.
但实际上,情况完全不同——请求需要到达领导者,然后领导者需要处理它,最后将响应发送回客户端。如图10.3所示,所有这些动作都需要时间并且不是瞬时的。
But in reality, things are quite different — the request needs to reach the leader, which then needs to process it and finally send back a response to the client. As shown in Figure 10.3, all these actions take time and are not instantaneous.
图10.3:写请求无法立即执行,因为它需要时间才能到达领导者并被执行。
Figure 10.3: A write request can’t execute instantaneously because it takes time to reach the leader and be executed.
系统可以提供的最佳保证是请求在其调用时间和完成时间之间的某个时间执行。您可能会认为这看起来没什么大不了的;毕竟,这是您编写单线程应用程序时所习惯的。如果将 1 赋给 x 并在之后立即读取其值,则假设没有其他线程写入同一变量,您预计会在其中找到 1。但是,一旦您开始处理在多个节点上复制其状态以实现高可用性和可扩展性的系统,所有的赌注都将落空。为了理解为什么会出现这种情况,我们将探索在复制存储中实现读取的不同方法。
The best guarantee the system can provide is that the request executes somewhere between its invocation and completion time. You might think that this doesn’t look like a big deal; after all, it’s what you are used to when writing single-threaded applications. If you assign 1 to x and read its value right after, you expect to find 1 in there, assuming there is no other thread writing to the same variable. But, once you start dealing with systems that replicate their state on multiple nodes for high availability and scalability, all bets are off. To understand why that’s the case, we will explore different ways to implement reads in our replicated store.
在10.1节中,我们研究了 Raft 如何将领导者的状态复制给其追随者。由于只有领导者才能对状态进行更改,因此任何修改状态的操作都必须经过领导者。但是阅读呢?它们不一定需要通过领导者,因为它们不会影响系统的状态。读取可以由领导者、追随者或领导者和追随者的组合来提供。如果所有读取都经过领导者,则读取吞吐量将受到单个进程的限制。但是,如果任何追随者都可以提供读取服务,那么两个客户端或观察者就可以对系统状态有不同的看法,因为追随者可能落后于领导者。
In section 10.1, we looked at how Raft replicates the leader’s state to its followers. Since only the leader can make changes to the state, any operation that modifies it needs to necessarily go through the leader. But what about reads? They don’t necessarily have to go through the leader as they don’t affect the system’s state. Reads can be served by the leader, a follower, or a combination of leader and followers. If all reads were to go through the leader, the read throughput would be limited by that of a single process. But, if reads can be served by any follower instead, then two clients, or observers, can have a different view of the system’s state, since followers can lag behind the leader.
直观上,观察者对系统的看法的一致性与系统的性能和可用性之间存在权衡。为了理解这种关系,我们需要准确定义一致性的含义。我们将借助一致性模型来做到这一点,该模型正式定义了系统状态观察者可以体验到的可能视图。
Intuitively, there is a trade-off between how consistent the observers’ views of the system are, and the system’s performance and availability. To understand this relationship, we need to define precisely what we mean by consistency. We will do so with the help of consistency models, which formally define the possible views of the system’s state observers can experience.
如果客户端专门向领导者发送写入和读取操作,则每个请求似乎都会在非常特定的时间点以原子方式发生,就好像存在数据的单个副本一样。无论有多少副本或落后多远,只要客户端始终直接查询领导者,从他们的角度来看,就有一个数据副本。
If clients send writes and reads exclusively to the leader, then every request appears to take place atomically at a very specific point in time as if there was a single copy of the data. No matter how many replicas there are or how far behind they are lagging, as long as the clients always query the leader directly, from their point of view there is a single copy of the data.
由于请求不是立即得到服务的,并且有一个进程为其提供服务,因此请求会在其调用时间和完成时间之间的某个时间执行。另一种思考方式是,一旦请求完成,它的副作用对所有观察者都是可见的,如图10.4所示。
Because a request is not served instantaneously, and there is a single process serving it, the request executes somewhere between its invocation and completion time. Another way to think about it is that once a request completes, it’s side-effects are visible to all observers as shown in Figure 10.4.
图 10.4:强一致性操作一旦完成,其副作用对所有观察者都是可见的。
Figure 10.4: The side-effects of a strongly consistent operation are visible to all observers once it completes.
由于请求在其调用和完成时间之间对所有其他参与者都是可见的,因此必须强制执行实时保证;这种保证通过称为线性化或强一致性的一致性模型来形式化。线性化是系统可以为单对象请求提供的最强一致性保证。
Since a request becomes visible to all other participants between its invocation and completion time, there is a real-time guarantee that must be enforced; this guarantee is formalized by a consistency model called linearizability, or strong consistency. Linearizability is the strongest consistency guarantee a system can provide for single-object requests.
如果客户端向领导者发送读取请求,并且当请求到达那里时,服务器假设它是领导者,但实际上它刚刚被废黜,该怎么办?如果前领导要处理该请求,系统将不再具有强一致性。为了防止这种情况,假定的领导者首先需要联系大多数副本来确认它是否仍然是领导者。只有这样才允许执行请求并将响应发送回客户端。这大大增加了读取所需的时间。
What if the client sends a read request to the leader and by the time the request gets there, the server assumes it’s the leader, but it actually was just deposed? If the ex-leader was to process the request, the system would no longer be strongly consistent. To guard against this case, the presumed leader first needs to contact a majority of the replicas to confirm whether it still is the leader. Only then it’s allowed to execute the request and send back the response to the client. This considerably increases the time required to serve a read.
到目前为止,我们已经讨论了通过领导者序列化所有读取。但这样做会产生一个单一的阻塞点,从而限制了系统的吞吐量。最重要的是,领导者需要联系大多数追随者来处理读取,这增加了处理请求所需的时间。为了提高读取性能,我们也可以允许关注者处理请求。
So far, we have discussed serializing all reads through the leader. But doing so creates a single choke point, which limits the system’s throughput. On top of that, the leader needs to contact a majority of followers to handle a read, which increases the time it takes to process a request. To increase the read performance, we could allow the followers to handle requests as well.
即使追随者可能落后于领导者,但它始终会按照与领导者相同的顺序接收新更新。假设一个客户端仅查询关注者 1,另一个客户端仅查询关注者 2。在这种情况下,两个客户端将看到状态在不同时间演变,因为关注者并不完全同步(参见图 10.5 )。
Even though a follower can lag behind the leader, it will always receive new updates in the same order as the leader. Suppose a client only ever queries follower 1, and another only ever queries follower 2. In that case, the two clients will see the state evolving at different times, as followers are not entirely in sync (see Figure 10.5).
图 10.5:虽然追随者对系统状态有不同的看法,但他们以相同的顺序处理更新。
Figure 10.5: Although followers have a different view of the systems’ state, they process updates in the same order.
在一致性模型中,操作对于所有观察者来说都以相同的顺序发生,但不提供任何关于操作的副作用何时对他们可见的实时保证,称为顺序一致性。缺乏实时保证是顺序一致性与线性化的区别。
The consistency model in which operations occur in the same order for all observers, but doesn’t provide any real-time guarantee about when an operation’s side-effect becomes visible to them, is called sequential consistency. The lack of real-time guarantees is what differentiates sequential consistency from linearizability.
与队列同步的生产者/消费者系统是您可能熟悉的此模型的一个示例;生产者进程将项目写入队列,消费者读取该队列。生产者和消费者看到的项目顺序相同,但消费者落后于生产者。
A producer/consumer system synchronized with a queue is an example of this model you might be familiar with; a producer process writes items to the queue, which a consumer reads. The producer and the consumer see the items in the same order, but the consumer lags behind the producer.
尽管我们设法提高了读取吞吐量,但我们必须将客户端固定到关注者 - 如果关注者出现故障怎么办?我们可以通过允许客户查询任何关注者来提高商店的可用性。但是,这在一致性方面付出了高昂的代价。假设有两个关注者 1 和 2,其中关注者 2 落后于关注者 1。如果客户端在关注者 2 之后查询关注者 1,它将看到过去的状态,这可能会非常令人困惑。客户端唯一的保证是,如果对系统的写入停止,最终所有追随者将收敛到最终状态。这种一致性模型称为最终一致性。
Although we managed to increase the read throughput, we had to pin clients to followers — what if a follower goes down? We could increase the availability of the store by allowing a client to query any follower. But, this comes at a steep price in terms of consistency. Say there are two followers, 1 and 2, where follower 2 lags behind follower 1. If a client queries follower 1 and right after follower 2, it will see a state from the past, which can be very confusing. The only guarantee the client has is that eventually, all followers will converge to the final state if the writes to the system stop. This consistency model is called eventual consistency.
在最终一致的数据存储之上构建应用程序具有挑战性,因为其行为与编写单线程应用程序时所习惯的行为不同。难以调试和重现的细微错误可能会蔓延。然而,在最终一致性的辩护中,并非所有应用程序都需要线性化。您需要有意识地选择数据存储提供的保证或缺乏的保证是否满足您的应用程序的要求。
It’s challenging to build applications on top of an eventually consistent data store because the behavior is different from the one you are used to when writing single-threaded applications. Subtle bugs can creep up that are hard to debug and reproduce. Yet, in eventual consistency’s defense, not all applications require linearizability. You need to make the conscious choice whether the guarantees offered by your data store, or lack thereof, satisfy your application’s requirements.
如果您想跟踪访问您网站的用户数量,最终一致的存储是完全可以的,因为读取返回的数字是否稍微过时并不重要。但对于支付处理商来说,你肯定需要强一致性。
An eventually consistent store is perfectly fine if you want to keep track of the number of users visiting your website, as it doesn’t really matter if a read returns a number that is slightly out of date. But for a payment processor, you definitely want strong consistency.
当网络分区发生时,系统的各个部分会相互断开连接。例如,某些客户可能无法再联系领导者。发生这种情况时,系统有两种选择:
When a network partition happens, parts of the system become disconnected from each other. For example, some clients might no longer be able to reach the leader. The system has two choices when this happens, it can either:
这个概念用CAP定理来表达,可以概括为:“强一致性、可用性和分区容错性:三选二”。实际上,实际上只能在强一致性和可用性之间进行选择,因为网络故障是给定的并且无法避免。
This concept is expressed by the CAP theorem, which can be summarized as: “strong consistency, availability and partition tolerance: pick two out of three.” In reality, the choice really is only between strong consistency and availability, as network faults are a given and can’t be avoided.
尽管可能会发生网络分区,但通常很少见。但是,在没有网络分区的情况下,一致性和延迟之间需要权衡。一致性保证越强,单个操作的延迟就必须越高。这种关系由PACELC 定理表达。它指出,在分布式计算机系统中出现网络分区(P)时,即使系统在没有分区的情况下正常运行,也必须在可用性(A)和一致性(C)之间做出选择,否则(E) ,必须在延迟 (L) 和一致性 (C) 之间做出选择。
Even though network partitions can happen, they are usually rare. But, there is a trade-off between consistency and latency in the absence of a network partition. The stronger the consistency guarantee is, the higher the latency of individual operations must be. This relationship is expressed by the PACELC theorem. It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).
为了提供高可用性和性能,现成的分布式数据存储(有时称为NoSQL存储)具有反直觉的一致性保证。其他的有旋钮可以让你选择是想要更好的性能还是更强的一致性保证,比如Azure 的 Cosmos DB和Cassandra。因此,您需要知道权衡是什么。有了在这里学到的知识,您将能够更好地理解为什么要进行权衡,以及它们对您的应用程序意味着什么。
To provide high availability and performance, off-the-shelf distributed data stores — sometimes referred to as NoSQL stores — come with counter-intuitive consistency guarantees. Others have knobs that allow you to choose whether you want better performance or stronger consistency guarantees, like Azure’s Cosmos DB and Cassandra. Because of that, you need to know what the trade-offs are. With what you have learned here, you will be in a much better place to understand why the trade-offs are there in the first place and what they mean for your application.
事务提供了一种错觉,即修改某些数据的一组操作对其具有独占访问权,并且所有操作都成功完成,或者没有操作成功完成。您通常可以利用事务来修改单个服务拥有的数据,因为它可能驻留在支持事务的单个数据存储中。另一方面,更新不同服务拥有的数据的事务(每个服务都有自己的数据存储)实施起来具有挑战性。本章将探讨如何在数据模型分区时向应用程序添加事务。
Transactions provide the illusion that a group of operations that modify some data has exclusive access to it and that either all operations complete successfully, or none does. You can typically leverage transactions to modify data owned by a single service, as it’s likely to reside in a single data store that supports transactions. On the other hand, transactions that update data owned by different services, each with its own data store, are challenging to implement. This chapter will explore how to add transactions to your application when your data model is partitioned.
考虑从一个银行账户到另一个银行账户的转账。如果提款成功,但存款不成功,资金需要存回源账户——钱不能就这样凭空消失。换句话说,传输需要原子地执行;要么取款和存款都成功,要么都不成功。为了实现这一点,提款和存款需要包装在一个不可分割的单元中:交易。
Consider a money transfer from one bank account to another. If the withdrawal succeeds, but the deposit doesn’t, the funds need to be deposited back into the source account — money can’t just disappear into thin air. In other words, the transfer needs to execute atomically; either both the withdrawal and the deposit succeed, or neither do. To achieve that, the withdrawal and deposit need to be wrapped in an inseparable unit: a transaction.
在传统的关系数据库中,事务是一组操作,数据库为其保证一组属性,称为 ACID:
In a traditional relational database, a transaction is a group of operations for which the database guarantees a set of properties, known as ACID:
事务将您从各种可能的故障场景中解救出来,以便您可以专注于实际的应用程序逻辑,而不是所有可能出错的事情。本章探讨分布式事务与 ACID 事务有何不同,以及如何在系统中实现它们。我们将主要关注原子性和隔离性。
Transactions relieve you from a whole range of possible failure scenarios so that you can focus on the actual application logic rather than all possible things that can go wrong. This chapter explores how distributed transactions differ from ACID transactions and how you can implement them in your systems. We will focus our attention mainly on atomicity and isolation.
访问相同数据的一组并发运行的事务可能会遇到各种竞争条件,例如脏写、脏读、模糊读和幻像读:
A set of concurrently running transactions that access the same data can run into all sorts of race conditions, like dirty writes, dirty reads, fuzzy reads, and phantom reads:
为了防止这些竞争条件,交易需要与其他交易隔离。隔离级别可以防止一种或多种类型的竞争条件,并提供我们可以用来推理并发性的抽象。隔离级别越强,针对竞争条件提供的保护就越多,但性能就越差。
To protect against these race conditions, a transaction needs to be isolated from others. An isolation level protects against one or more types of race conditions and provides an abstraction that we can use to reason about concurrency. The stronger the isolation level is, the more protection it offers against race conditions, but the less performant it is.
事务可以具有不同类型的隔离级别,这些隔离级别是根据事务禁止的竞争条件类型定义的,如图11.1所示。
Transactions can have different types of isolation levels that are defined based on the type of race conditions they forbid, as shown in Figure 11.1.
图 11.1:隔离级别定义了它们禁止哪些竞争条件。
Figure 11.1: Isolation levels define which race conditions they forbid.
可串行性是防止所有可能的竞争条件的唯一隔离级别。它保证执行一组事务的副作用看起来与顺序执行相同,一个接一个。但是,我们仍然有一个问题 - 事务的执行顺序有很多种,因为可串行性并没有说明要选择哪一个。
Serializability is the only isolation level that guards against all possible race conditions. It guarantees that the side effects of executing a set of transactions appear to be the same as if they had executed sequentially, one after the other. But, we still have a problem — there are many possible orders that the transactions can appear to be executed in, as serializability doesn’t say anything about which one to pick.
假设我们有两个事务 A 和 B,事务 B 在事务 A 后 5 分钟完成。保证可串行性的系统可以对它们重新排序,以便 B 的更改先于 A 的更改应用。为了增加对事务顺序的实时性要求,我们需要更强的隔离级别:严格的可串行性。此级别将可串行化与线性化提供的实时保证相结合,以便当事务完成时,其副作用对所有未来事务立即可见。
Suppose we have two transactions A and B, and transaction B completes 5 minutes after transaction A. A system that guarantees serializability can reorder them so that B’s changes are applied before A’s. To add a real-time requirement on the order of transactions, we need a stronger isolation level: strict serializability. This level combines serializability with the real-time guarantees that linearizability provides so that when a transaction completes, its side effects become immediately visible to all future transactions.
(严格)可串行化速度很慢,因为它需要协调,这会在系统中产生争用。因此,有许多不同的隔离级别更容易实现并且性能也更好。您的应用程序可能不需要可串行化,但您需要有意识地决定使用哪种隔离级别并理解其含义,否则您的数据存储将默默地为您做出决定;例如,PostgreSQL的默认隔离是读已提交。如有疑问,请选择严格的可串行性。
(Strict) serializability is slow as it requires coordination, which creates contention in the system. As a result, there are many different isolation levels that are simpler to implement and also perform better. Your application might not need serializability, but you need to consciously decide which isolation level to use and understand its implications, or your data store will silently make the decision for you; for example, PostgreSQL’s default isolation is read committed. When in doubt, choose strict serializability.
隔离级别和竞争条件比我们在这里讨论的要多。Jepsen提供了关于现有隔离级别、它们之间的相互关系以及它们提供的保证的良好正式参考。尽管供应商通常记录其产品提供的隔离级别,但这些规范并不总是与正式定义相符。
There are more isolation levels and race conditions than the ones we discussed here. Jepsen provides a good formal reference of the existing isolation levels, how they relate to one another, and which guarantees they offer. Although vendors typically document the isolation levels their products offer, these specifications don’t always match the formal definitions.
现在我们知道什么是可串行性,让我们看看如何实现它以及为什么它在性能方面如此昂贵。可以通过悲观或乐观并发控制机制来实现可串行性。
Now that we know what serializability is, let’s look at how it can be implemented and why it’s so expensive in terms of performance. Serializability can be achieved either with a pessimistic or an optimistic concurrency control mechanism.
悲观并发控制使用锁来阻止其他事务访问数据项。最流行的悲观协议是两阶段锁定(2PL)。2PL 有两种类型的锁,一种用于读取,一种用于写入。读锁可以由以只读模式访问数据项的多个事务共享,但它会阻止尝试获取写锁的事务。后者只能由单个事务持有,并阻止其他任何试图获取数据项的读锁或写锁的人。
Pessimistic concurrency control uses locks to block other transactions from accessing a data item. The most popular pessimistic protocol is two-phase locking (2PL). 2PL has two types of locks, one for reads and one for writes. A read lock can be shared by multiple transactions that access the data item in read-only mode, but it blocks transactions trying to acquire a write lock. The latter can be held only by a single transaction and blocks anyone else trying to acquire either a read or write lock on the data item.
2PL 有两个阶段:扩张阶段和收缩阶段。在扩展阶段,事务只允许获取锁,不允许释放锁。在收缩阶段,事务只允许释放锁,而不允许获取锁。如果遵守这些规则,就可以正式证明该协议保证了可串行性。
There are two phases in 2PL, an expanding and a shrinking one. In the expanding phase, the transaction is allowed only to acquire locks, but not to release them. In the shrinking phase, the transaction is permitted only to release locks, but not to acquire them. If these rules are obeyed, it can be formally proven that the protocol guarantees serializability.
并发控制的乐观方法不会阻塞,因为它仅在事务的最后检查冲突。如果检测到冲突,事务将中止或从头开始。通常,乐观并发控制是通过多版本并发控制(MVCC)来实现的。通过 MVCC,数据存储可以保留数据项的多个版本。只读事务不会被其他事务阻塞,因为它们可以继续读取事务启动时提交的数据版本。但是,当检测到冲突时,写入存储的事务将中止或重新启动。虽然 MVCC 本身不能保证可串行化,但它的一些变体可以保证可串行化,例如可串行化快照隔离(SSI)。
The optimistic approach to concurrency control doesn’t block, as it checks for conflicts only at the very end of a transaction. If a conflict is detected, the transaction is aborted or restarted from the beginning. Generally, optimistic concurrency control is implemented with multi-version concurrency control (MVCC). With MVCC, the data store keeps multiple versions of a data item. Read-only transactions aren’t blocked by other transactions, as they can keep reading the version of the data that was committed at the time the transaction started. But, a transaction that writes to the store is aborted or restarted when a conflict is detected. While MVCC per se doesn’t guarantee serializability, there are variations of it that do, like Serializable Snapshot Isolation (SSI).
当您具有仅偶尔执行写入的大量读取工作负载时,乐观并发就有意义,因为读取不需要采取任何锁定。对于写入量大的负载,悲观协议更有效,因为它可以避免重复重试相同的事务。
Optimistic concurrency makes sense when you have read-heavy workloads that only occasionally perform writes, as reads don’t need to take any locks. For write-heavy loads, a pessimistic protocol is more efficient as it avoids retrying the same transactions repeatedly.
我故意没有花太多时间来描述 2PL 和 MVCC,因为您不太可能需要在系统中实现它们。但是,您的系统所依赖的商业数据存储依赖于使用一种或另一种技术来隔离事务,因此您必须对权衡有一个基本的掌握。
I have deliberately not spent much time describing 2PL and MVCC, as it’s unlikely you will have to implement them in your systems. But, the commercial data stores your systems depend on use one or the other technique to isolate transactions, so you must have a basic grasp of the tradeoffs.
回到我们最初将钱从一个银行帐户发送到另一个银行帐户的示例,假设这两个帐户属于两个使用单独数据存储的不同银行。我们应该如何保证两个帐户的原子性?我们不能只运行两个单独的交易来分别提取和存入资金——如果第二个交易失败,那么系统就会处于不一致的状态。我们需要原子性:保证两个事务都成功并提交它们的更改,或者它们失败且没有任何副作用。
Going back to our original example of sending money from one bank account to another, suppose the two accounts belong to two different banks that use separate data stores. How should we go about guaranteeing atomicity across the two accounts? We can’t just run two separate transactions to respectively withdraw and deposit the funds — if the second transaction fails, then the system is left in an inconsistent state. We need atomicity: the guarantee that either both transactions succeed and their changes are committed, or that they fail without any side effects.
两阶段提交(2PC)是一种用于实现跨多个进程的原子事务提交的协议。该协议分为两个阶段,准备阶段和提交阶段。它假设一个进程充当协调者并协调其他进程(称为参与者)的操作。通常,发起事务的客户端应用程序充当协议的协调器。
Two-phase commit (2PC) is a protocol used to implement atomic transaction commits across multiple processes. The protocol is split into two phases, prepare and commit. It assumes a process acts as coordinator and orchestrates the actions of the other processes, called participants. Generally, the client application that initiates the transaction acts as the coordinator for the protocol.
当协调者想要向参与者提交事务时,它会发送一个准备请求,询问参与者是否准备好提交事务(见图11.2)。如果所有参与者都回复说他们准备好提交,则协调器会向所有参与者发送一条提交消息,命令他们这样做。相反,如果任何进程回复无法提交或没有及时响应,协调器就会向所有参与者发送中止请求。
When a coordinator wants to commit a transaction to the participants, it sends a prepare request asking the participants whether they are prepared to commit the transaction (see Figure 11.2). If all participants reply that they are ready to commit, the coordinator sends out a commit message to all participants ordering them to do so. In contrast, if any process replies that it’s unable to commit, or doesn’t respond promptly, the coordinator sends an abort request to all participants.
图 11.2:两阶段提交协议由准备阶段和提交阶段组成。
Figure 11.2: The two-phase commit protocol consists of a prepare and a commit phase.
协议中有两点不可返回。如果参与者回复准备提交的准备消息,则无论如何都必须稍后这样做。参与者无法从该点开始取得进展,直到它收到来自协调者的提交或中止事务的消息。这意味着如果协调者崩溃,参与者将被完全阻止。
There are two points of non-return in the protocol. If a participant replies to a prepare message that it’s ready to commit, it will have to do so later, no matter what. The participant can’t make progress from that point onward until it receives a message from the coordinator to either commit or abort the transaction. This means that if the coordinator crashes, the participant is completely blocked.
另一个不返回点是领导者在收到所有参与者对其准备消息的响应后决定提交或中止事务。一旦协调者做出决定,以后就不能改变主意,并且无论如何都必须了解事务是提交还是中止。如果参与者暂时宕机,协调者将不断重试,直到请求最终成功。
The other point of non-return is when the leader decides to commit or abort the transaction after receiving a response to its prepare message from all participants. Once the coordinator makes the decision, it can’t change its mind later and has to see through that the transaction is committed or aborted, no matter what. If a participant is temporarily down, the coordinator will keep retrying until the request eventually succeeds.
两阶段提交的声誉褒贬不一。它很慢,因为它需要多次往返才能完成事务,并且在出现故障时会阻塞。如果协调者或参与者发生故障,则事务的所有进程部分都会被阻止,直到发生故障的进程重新上线。最重要的是,参与者需要执行该协议;你不能只采用 PostgreSQL 和 Cassandra 并期望它们能够相互配合。
Two-phase commit has a mixed reputation. It’s slow, as it requires multiple round trips to complete a transaction and blocks when there is a failure. If either the coordinator or a participant fails, then all processes part of the transactions are blocked until the failing process comes back online. On top of that, the participants need to implement the protocol; you can’t just take PostgreSQL and Cassandra and expect them to play ball with each other.
如果我们愿意增加复杂性,就有一种方法可以使协议对故障更具弹性。原子地提交交易是一种共识形式,称为“统一共识”,其中所有进程都必须就某个值达成一致,即使是错误的值。相比之下,第10.2节中引入的共识的一般形式仅保证所有无故障进程都同意建议的值。因此,统一共识实际上比共识更难。
If we are willing to increase complexity, there is a way to make the protocol more resilient to failures. Atomically committing a transaction is a form of consensus, called “uniform consensus,” where all the processes have to agree on a value, even the faulty ones. In contrast, the general form of consensus introduced in section 10.2 only guarantees that all non-faulty processes agree on the proposed value. Therefore, uniform consensus is actually harder than consensus.
然而,仍然可以使用现成的共识实现来使 2PC 对故障更加鲁棒。例如,使用像 Raft 这样的共识算法复制协调器使得 2PC 对协调器故障具有弹性。同样,参与者也可以被复制。
Yet, an off-the-shelf consensus implementation can still be used to make 2PC more robust to failures. For example, replicating the coordinator with a consensus algorithm like Raft makes 2PC resilient to coordinator failures. Similarly, the participants could also be replicated.
作为历史旁注,2000 年代末出现的现代大规模数据存储的第一个版本过去被称为NoSQL存储,因为它们的核心功能完全专注于可扩展性,缺乏传统关系数据库的保证。例如 ACID 事务。但近年来,随着分布式数据存储不断添加只有传统数据库提供的功能,并且 ACID 事务已成为常态而不是例外,这种情况已经发生了很大的变化。例如,Google的Spanner使用2PC和状态机复制的组合来实现跨分区的事务。
As a historical side-note, the first versions of modern large-scale data stores that came out in the late 2000s used to be referred to as NoSQL stores as their core features were focused entirely on scalability and lacked the guarantees of traditional relational databases, such as ACID transactions. But in recent years, that has mostly changed as distributed data stores have continued to add features that only traditional databases offered, and ACID transactions have become the norm rather than the exception. For example, Google’s Spanner implements transactions across partitions using a combination of 2PC and state machine replication.
2PC是同步阻塞协议;如果任何参与者不可用,则事务无法取得任何进展,并且应用程序完全阻塞。假设协调者和参与者都是可用的,并且交易的持续时间是短暂的。由于其阻塞性质,2PC 通常与阻塞并发控制机制(如 2PL)结合起来,以提供隔离。
2PC is a synchronous blocking protocol; if any of the participants isn’t available, the transaction can’t make any progress, and the application blocks completely. The assumption is that the coordinator and the participants are available and that the duration of the transaction is short-lived. Because of its blocking nature, 2PC is generally combined with a blocking concurrency control mechanism, like 2PL, to provide isolation.
但是,某些类型的事务可能需要数小时才能执行,在这种情况下,阻止是不可行的。有些事务一开始就不需要隔离。假设我们要放弃隔离要求和事务短暂的假设。我们能否提出一个仍然提供原子性的异步非阻塞解决方案?
But, some types of transactions can take hours to execute, in which case blocking just isn’t an option. And some transactions don’t need isolation in the first place. Suppose we were to drop the isolation requirement and the assumption that the transactions are short-lived. Can we come up with an asynchronous non-blocking solution that still provides atomicity?
现代应用程序中的典型模式是在针对不同用例(例如搜索或分析)定制的不同数据存储中复制相同的数据。假设我们拥有一个由关系数据库支持的产品目录服务,并且我们决定在其 API 中提供高级全文搜索功能。尽管某些关系数据库提供基本的全文搜索功能,但更高级的用例需要专用数据库(例如 Elasticsearch)。
A typical pattern in modern applications is replicating the same data in different data stores tailored to different use cases, like search or analytics. Suppose we own a product catalog service backed by a relational database, and we decided to offer an advanced full-text search capability in its API. Although some relational databases offer basic full-text search functionality, a dedicated database such as Elasticsearch is required for more advanced use cases.
为了与搜索索引集成,目录服务需要在添加新产品或修改或删除现有产品时更新关系数据库和搜索索引。该服务可以先更新关系数据库,然后更新搜索索引;但如果服务在更新搜索索引之前崩溃,系统将处于不一致状态。正如您现在可以猜到的,我们需要以某种方式将这两个更新包装到一个事务中。
To integrate with the search index, the catalog service needs to update both the relational database and the search index when a new product is added or an existing product is modified or deleted. The service could just update the relational database first, and then the search index; but if the service crashes before updating the search index, the system would be left in an inconsistent state. As you can guess by now, we need to wrap the two updates into a transaction somehow.
我们可以考虑使用 2PC,但是虽然关系数据库支持X/Open XA 2PC 标准,但搜索索引不支持,这意味着我们必须从头开始实现该功能。如果搜索索引暂时不可用,我们也不希望目录服务被阻止。尽管我们希望两个数据存储同步,但我们可以接受一些暂时的不一致。换句话说,最终一致性对于我们的用例来说是可以接受的。
We could consider using 2PC, but while the relational database supports the X/Open XA 2PC standard, the search index doesn’t, which means we would have to implement that functionality from scratch. We also don’t want the catalog service to block if the search index is temporarily unavailable. Although we want the two data stores to be in sync, we can accept some temporary inconsistencies. In other words, eventual consistency is acceptable for our use case.
为了解决这个问题,我们在应用程序中引入消息日志。日志是仅附加的、完全有序的消息序列,其中每条消息都分配有唯一的顺序索引。消息附加在日志的末尾,消费者按顺序从中读取。Kafka和Azure 事件中心是两种流行的日志实现。
To solve this problem, let’s introduce an message log in our application. A log is an append-only, totally ordered sequence of messages, in which each message is assigned a unique sequential index. Messages are appended at the end of the log, and consumers read from it in order. Kafka and Azure Event Hubs are two popular implementations of logs.
现在,当目录服务收到客户端创建新产品的请求时,它不会写入关系数据库或搜索索引,而是将产品创建消息附加到消息日志中。追加充当事务的原子提交步骤。关系数据库和搜索索引是消息日志的异步使用者,按照添加条目的相同顺序读取条目,并按照自己的节奏更新其状态(见图 11.3 )。由于消息日志是有序的,因此它保证消费者以相同的顺序看到条目。
Now, when the catalog service receives a request from a client to create a new product, rather than writing to the relational database, or the search index, it appends a product creation message to the message log. The append acts as the atomic commit step for the transaction. The relational database and the search index are asynchronous consumers of the message log, reading entries in the same order as they were appended and updating their state at their own pace (see Figure 11.3). Because the message log is ordered, it guarantees that the consumers see the entries in the same order.
图 11.3:生产者在日志末尾附加条目,而消费者则按照自己的节奏读取条目。
Figure 11.3: The producer appends entries at the end of the log, while the consumers read the entries at their own pace.
消费者定期检查他们处理的最后一条消息的索引。如果消费者崩溃并在一段时间后重新上线,它会读取最后一个检查点并从中断处继续读取消息。这样做可以确保即使消费者离线一段时间也不会丢失数据。
The consumers periodically checkpoint the index of the last message they processed. If a consumer crashes and comes back online after some time, it reads the last checkpoint and resumes reading messages from where it left off. Doing so ensures there is no data loss even if the consumer was offline for some time.
但是,存在一个问题,因为消费者可能会多次阅读同一条消息。例如,消费者可能会在检查消息状态之前处理消息并崩溃。当它重新上线时,它最终会重新读取相同的消息。因此,消息需要是幂等的,这样无论读取多少次,效果都应该与只处理一次相同。一种方法是用唯一的 ID 修饰每条消息,并在读取时忽略具有重复 ID 的消息。
But, there is a problem as the consumer can potentially read the same message multiple times. For example, the consumer could process a message and crash before checkpointing its state. When it comes back online, it will eventually re-read the same message. Therefore, messages need to be idempotent so that no matter how many times they are read, the effect should be the same as if they had been processed only once. One way to do that is to decorate each message with a unique ID and ignore messages with duplicate IDs at read time.
我们在第 10章讨论状态机复制时已经遇到了日志抽象。如果你仔细观察,你会发现我们刚刚在这里实现的是一种状态机复制形式,其中状态由目录中的所有产品表示,并且复制发生在关系数据库和搜索索引中。
We have already encountered the log abstraction in chapter 10 when discussing state machine replication. If you squint a little, you will see that what we have just implemented here is a form of state machine replication, where the state is represented by all products in the catalog, and the replication happens across the relational database and the search index.
消息日志是更通用的通信交互方式(称为消息传递)的一部分。在这个模型中,发送者和接收者不直接相互通信;他们通过充当经纪人的通道交换消息。发送者将消息发送到通道,而在另一端,接收者从中读取消息。
Message logs are part of a more general communication interaction style referred to as messaging. In this model, the sender and the receiver don’t communicate directly with each other; they exchange messages through a channel that acts as a broker. The sender sends messages to the channel, and on the other side, the receiver reads messages from it.
消息通道充当接收者的临时缓冲区。与我们迄今为止使用的直接请求响应通信方式不同,消息传递本质上是异步的,因为发送消息不需要接收服务在线。
A message channel acts as a temporary buffer for the receiver. Unlike the direct request-response communication style we have been using so far, messaging is inherently asynchronous as sending a message doesn’t require the receiving service to be online.
消息具有明确定义的格式,由标头和正文组成。消息标头包含元数据,例如唯一的消息 ID,而其正文包含实际内容。通常,消息可以是命令(指定接收者要调用的操作),也可以是事件(向接收者发出信号,表明发送者中发生了感兴趣的事情)。
A message has a well-defined format, consisting of a header and a body. The message header contains metadata, such as a unique message ID, while its body contains the actual content. Typically, a message can either be a command, which specifies an operation to be invoked by the receiver, or an event, which signals the receiver that something of interest happened in the sender.
服务使用入站适配器从消息通道(其 API 表面的一部分)接收消息,并使用出站适配器发送消息,如图 11.4所示。我们之前使用的日志抽象只是消息传递通道的一种形式。在本书的后面,我们将遇到其他类型的通道,例如队列,它们不保证消息的任何顺序。
Services use inbound adapters to receive messages from messaging channels, which are part of their API surface, and outbound adapters to send messages, as shown in Figure 11.4. The log abstraction we have used earlier is just one form of messaging channel. Later in the book, we will encounter other types of channels, like queues, that don’t guarantee any ordering of the messages.
图 11.4:入站消息传递适配器是服务 API 表面的一部分。
Figure 11.4: Inbound messaging adapters are part of a service’s API surface.
假设我们拥有旅行预订服务。要预订旅行,旅行社必须通过专门的服务自动预订航班,并通过另一个服务自动预订酒店。但是,这些服务中的任何一个都可能无法满足其各自的请求。如果一个预订成功,但另一个预订失败,则需要取消前者以保证原子性。因此,预订行程需要多个步骤才能完成,其中一些步骤仅在失败时才需要。由于将单个消息附加到日志中不再足以提交事务,因此我们不能使用前面介绍的简单的面向日志的解决方案。
Suppose we own a travel booking service. To book a trip, the travel service has to atomically book a flight through a dedicated service and a hotel through another. However, either of these services can fail their respective requests. If one booking succeeds, but the other fails, then the former needs to be canceled to guarantee atomicity. Hence, booking a trip requires multiple steps to complete, some of which are only required in case of failure. Since appending a single message to a log is no longer sufficient to commit the transaction, we can’t use the simple log-oriented solution presented earlier.
Saga模式提供了解决这个问题的方法。saga 是由一组本地事务组成的分布式事务, 在哪里有相应的补偿本地事务用于撤消其更改。Saga 保证所有本地事务都成功,或者在失败的情况下,补偿本地事务完全撤消事务的部分执行。这样保证了协议的原子性;要么所有本地事务都成功,要么全部都不成功。Saga 可以通过协调器(事务的协调器)来实现,它管理跨所涉及的进程(事务的参与者)的本地事务的执行。
The Saga pattern provides a solution to this problem. A saga is a distributed transaction composed of a set of local transactions , where has a corresponding compensating local transaction used to undo its changes. The Saga guarantees that either all local transactions succeed, or in case of failure, that the compensating local transactions undo the partial execution of the transaction altogether. This guarantees the atomicity of the protocol; either all local transactions succeed, or none of them do. A Saga can be implemented with an orchestrator, the transaction’s coordinator, that manages the execution of the local transactions across the processes involved, the transaction’s participants.
在我们的示例中,旅行预订服务是交易的协调者,而航班和酒店预订服务是交易的参与者。Saga 由三个本地事务组成:预订航班,预订酒店,以及取消预订的航班。
In our example, the travel booking service is the transaction’s coordinator, while the flight and hotel booking services are the transaction’s participants. The Saga is composed of three local transactions: books a flight, books a hotel, and cancels the flight booked with .
在较高层面上,Saga 可以通过图11.5所示的工作流程来实现:
At a high level, the Saga can be implemented with the workflow depicted in Figure 11.5:
协调者可以通过消息通道与参与者异步通信,以容忍暂时不可用的参与者。由于事务需要多个步骤才能成功,并且协调器可能随时失败,因此它需要在事务进展时保留事务的状态。通过将事务建模为状态机,协调器可以在从一种状态转换到下一种状态时将其状态持久地检查到数据库。这确保了如果协调器崩溃并重新启动,或者另一个进程被选为协调器,它可以通过读取最后一个检查点从中断处恢复事务。
The coordinator can communicate asynchronously with the participants via message channels to tolerate temporarily unavailable ones. As the transaction requires multiple steps to succeed, and the coordinator can fail at any time, it needs to persist the state of the transaction as it advances. By modeling the transaction as a state machine, the coordinator can durably checkpoint its state to a database as it transitions from one state to the next. This ensures that if the coordinator crashes and restarts, or another process is elected as the coordinator, it can resume the transaction from where it left off by reading the last checkpoint.
图 11.5:实现异步事务的工作流程。
Figure 11.5: A workflow implementing an asynchronous transaction.
不过,有一个警告;如果协调器在发送请求后但在备份其状态之前崩溃,则在恢复联机状态时必须重新发送请求。同样,如果发送请求超时,协调器将不得不重试,导致消息在接收端出现两次。因此,参与者必须对收到的消息进行重复删除,以使其具有幂等性。
There is a caveat, though; if the coordinator crashes after sending a request but before backing up its state, it will have to re-send the request when it comes back online. Similarly, if sending a request times-out, the coordinator will have to retry it, causing the message to appear twice at the receiving end. Hence, the participants have to de-duplicate the messages they receive to make them idempotent.
在实践中,您不需要从头开始构建编排引擎来实现此类工作流程。AWS Step Functions或Azure Durable Functions等无服务器云计算服务可以轻松创建完全托管的工作流程。
In practice, you don’t need to build orchestration engines from scratch to implement such workflows. Serverless cloud compute services such as AWS Step Functions or Azure Durable Functions make it easy to create fully-managed workflows.
我们开始了异步事务之旅,将其作为围绕 2PC 的阻塞性质进行设计的一种方式。为此,我们必须牺牲传统 ACID 事务提供的隔离保证。事实证明,我们也可以解决缺乏隔离的问题。例如,一种方法是使用语义锁。这个想法是,Saga 修改的任何数据都标有脏标志。该标志仅在事务完成时才清除。尝试访问脏记录的另一个事务可能会失败并回滚其更改,或者会阻塞直到脏标志被清除。不过,后一种方法可能会导致死锁,因此需要采取策略来缓解死锁。
We started our journey into asynchronous transactions as a way to design around the blocking nature of 2PC. To get here, we had to sacrifice the isolation guarantee that traditional ACID transactions provide. As it turns out, we can work around the lack of isolation as well. For example, one way to do that is with the use of semantic locks. The idea is that any data the Saga modifies is marked with a dirty flag. This flag is only cleared at the end of the transaction when it completes. Another transaction trying to access a dirty record can either fail and roll back its changes, or block until the dirty flag is cleared. The latter approach can introduce deadlocks, though, which requires a strategy to mitigate them.
现在我们了解了如何协调流程,我们准备好深入研究构建分布式系统的主要用例之一:可扩展性。
Now that we understand how to coordinate processes, we are ready to dive into one of the main use cases for building distributed systems: scalability.
可扩展的应用程序可以随着负载的增加而增加其容量。最简单的方法是在更昂贵的硬件上扩展并运行应用程序,但这只能让您到目前为止,因为应用程序最终将达到性能上限。
A scalable application can increase its capacity as its load increases. The simplest way to do that is by scaling up and running the application on more expensive hardware, but that only brings you so far since the application will eventually reach a performance ceiling.
纵向扩展的替代方法是通过将负载分配到多个节点来横向扩展。本部分探讨可扩展性模式的 3 个类别(或维度):功能分解、分区和复制。这些维度的优点在于它们彼此独立并且可以在同一应用程序中组合。
The alternative to scaling up is scaling out by distributing the load over multiple nodes. This part explores 3 categories — or dimensions — of scalability patterns: functional decomposition, partitioning, and duplication. The beauty of these dimensions is that they are independent of each other and can be combined within the same application.
功能分解
Functional decomposition
功能分解是将应用程序分解为各个部分的过程。回想一下您上次编写代码的时间;您很可能将其分解为函数、类和模块。可以通过将应用程序分解为单独的服务来进一步采用相同的想法,每个服务都有自己明确定义的职责。
Functional decomposition is the process of taking an application and breaking it down into individual parts. Think of the last time you wrote some code; you most likely decomposed it into functions, classes, and modules. The same idea can be taken further by decomposing an application into separate services, each with its own well-defined responsibility.
12.1节讨论了将应用程序拆分为一组可独立部署的服务的优点和缺点。
Section 12.1 discusses the advantages and pitfalls of splitting an application into a set of independently deployable services.
第12.2节描述了在使用 API 网关将应用程序分解为服务后,外部客户端如何与应用程序进行通信。网关通过路由、编写和转换请求充当应用程序的代理。
Section 12.2 describes how external clients can communicate with an application after it has been decomposed into services using an API gateway. The gateway acts as the proxy for the application by routing, composing, and translating requests.
第12.3节讨论了如何将 API 的读取路径与其写入路径解耦,以便它们各自的实现可以使用适合其特定用例的不同技术。
Section 12.3 discusses how to decouple an API’s read path from its write path so that their respective implementations can use different technologies that fit their specific use cases.
第12.4节深入探讨了异步消息传递通道,该通道将通道一端的生产者与另一端的消费者解耦。借助渠道,即使目的地暂时不可用,两方之间也可以进行通信。消息传递还提供了其他一些好处,我们将在本节中探讨这些好处,以及您可能遇到的最佳实践和陷阱。
Section 12.4 dives into asynchronous messaging channels that decouple producers on one end of a channel from consumers on the other end. Thanks to channels, communication between two parties is possible even if the destination is temporarily not available. Messaging provides several other benefits, which we will explore in this section, along with best practices and pitfalls you can run into.
分区
Partitioning
当数据集不再适合单个节点时,需要将其划分到多个节点上。分区是一种通用技术,可用于多种情况,例如在负载均衡器的后端之间对 TCP 连接进行分片。
When a dataset no longer fits on a single node, it needs to be partitioned across multiple nodes. Partitioning is a general technique that can be used in a variety of circumstances, like sharding TCP connections across backends in a load balancer.
我们将在13.1节中探讨不同的分片策略,例如范围和散列分区。然后,在13.2节中,我们将讨论如何静态或动态地重新平衡分区。
We will explore different sharding strategies in section 13.1, such as range and hash partitioning. Then, in section 13.2, we will discuss how to rebalance partitions either statically or dynamically.
复制
Duplication
向服务添加更多容量的最简单方法是创建更多实例,并采用某种方式路由或平衡对它们的请求。只要您考虑到对服务依赖项的影响,这可能是一种快速且廉价的扩展无状态服务的方法。由于需要某种形式的协调,因此横向扩展有状态服务更具挑战性。
The easiest way to add more capacity to a service is to create more instances of it and have some way of routing, or balancing, requests to them. This can be a fast and cheap way to scale out a stateless service, as long as you have considered the impact on the service’s dependencies. Scaling out a stateful service is significantly more challenging as some form of coordination is required.
14.1节介绍了跨节点负载均衡请求的概念及其使用商用机器的实现。我们将从 DNS 负载平衡开始,然后深入研究在网络堆栈的传输层和应用程序层运行的负载平衡器的实现。最后,我们将讨论地理负载平衡,它允许客户端与地理位置最近的数据中心进行通信。
Section 14.1 introduces the concept of load balancing requests across nodes and its implementation using commodity machines. We will start with DNS load balancing and then dive into the implementation of load balancers that operate at the transport and application layer of the network stack. Finally, we will discuss geo load balancing that allows clients to communicate with the geographically closest datacenter.
14.2节描述了如何跨节点复制数据并保持同步。尽管我们已经在第10章中讨论了使用 Raft 实现此目的的一种方法,但在本节中,我们将更广泛地审视该主题,并探索具有不同权衡的不同方法(单领导者、多领导者和无领导者) 。
Section 14.2 describes how to replicate data across nodes and keep it in sync. Although we have already discussed one way of doing that with Raft in chapter 10, in this section, we will take a broader look at the topic and explore different approaches with varying trade-offs (single-leader, multi-leader, and leaderless).
14.3节讨论了缓存的优点和缺点。我们将首先讨论进程内缓存,它很容易实现,但有一些缺陷。最后,我们将看看外部缓存的优缺点。
Section 14.3 discusses the benefits and pitfalls of caching. We will start by discussing in-process caches first, which are easy to implement but have several pitfalls. Finally, we will look at the pros and cons of external caches.
应用程序通常以单体应用的形式开始其生命周期。以单页 JavaScript 应用程序 (SPA) 的现代后端为例。它最初可能是一个单一的无状态 Web 服务,公开 RESTful HTTP API 并使用关系数据库作为后备存储。该服务可能由许多实现不同业务功能的组件或库组成,如图12.1所示。
An application typically starts its life as a monolith. Take a modern backend of a single-page JavaScript application (SPA), for example. It might start out as a single stateless web service that exposes a RESTful HTTP API and uses a relational database as a backing store. The service is likely to be composed of a number of components or libraries that implement different business capabilities, as shown in Figure 12.1.
图 12.1:由多个组件组成的整体后端。
Figure 12.1: A monolithic backend composed of multiple components.
随着为同一代码库做出贡献的功能团队数量的增加,组件随着时间的推移变得越来越耦合。这导致团队越来越频繁地互相触碰,从而降低了他们的生产力。
As the number of feature teams contributing to the same codebase increases, the components become increasingly coupled over time. This leads the teams to step on each other’s toes more and more frequently, decreasing their productivity.
代码库变得足够复杂,以至于没有人完全理解它的每个部分,并且实现新功能或修复错误变得非常耗时。即使后端被组件化为不同团队拥有的不同库,对库的更改也需要重新部署服务。如果更改引入了内存泄漏等错误,则整个服务可能会受到影响。此外,回滚有缺陷的构建会影响所有团队的速度,而不仅仅是引入错误的团队。
The codebase becomes complex enough that nobody fully understands every part of it, and implementing new features or fixing bugs becomes time-consuming. Even if the backend is componentized into different libraries owned by different teams, a change to a library requires the service to be redeployed. And if a change introduces a bug like a memory leak, the entire service can potentially be affected by it. Additionally, rolling back a faulty build affects the velocity of all teams, not just the one that introduced the bug.
减轻单一后端的成长烦恼的一种方法是将其拆分为一组通过 API 进行通信的独立可部署服务,如图12.2所示。与同一进程中运行的组件之间的边界不同,API 通过创建难以违反的边界来将服务彼此解耦。
One way to mitigate the growing pains of a monolithic backend is to split it into a set of independently deployable services that communicate via APIs, as shown in Figure 12.2. The APIs decouple the services from each other by creating boundaries that are hard to violate, unlike the ones between components running in the same process.
图 12.2:后端分为可独立部署的服务,这些服务通过 API 进行通信。
Figure 12.2: A backend split into independently deployable services that communicate via APIs.
这种架构风格也称为微服务架构。不过, “微观”这个词可能会产生误导——服务不一定是微观的。事实上,我认为如果一项服务没有做太多事情,它只会产生比收益更多的运营开销。这种架构的一个更合适的名称是面向服务的架构,但不幸的是,这个名称也带有一些旧的包袱。也许十年后,我们会用另一个名字来称呼同一个概念,但现在我们必须坚持微服务。
This architectural style is also referred to as the microservice architecture. The term micro can be misleading, though — there doesn’t have to be anything micro about the services. In fact, I would argue that if a service doesn’t do much, it just creates more operational overhead than benefits. A more appropriate name for this architecture is service-oriented architecture, but unfortunately, that name comes with some old baggage as well. Perhaps in 10 years, we will call the same concept with yet another name, but for now we will have to stick to microservices.
将后端按业务功能分解为一组边界明确的服务,使得每个服务都可以由单个小团队开发和运营。较小的团队可以出于多种原因提高应用程序的开发速度:
Breaking down the backend by business capabilities into a set of services with well-defined boundaries allows each service to be developed and operated by a single small team. Smaller teams can increase the application’s development speed for a variety of reasons:
微服务架构为整个系统添加了更多移动部件,但这并不是免费的。只有能够在数十个开发团队中分摊,完全采用微服务的成本才值得付出。
The microservice architecture adds more moving parts to the overall system, and this doesn’t come for free. The cost of fully embracing microservices is only worth paying if it can be amortized across dozens of development teams.
开发经验
Development experience
没有什么禁止在每个微服务中使用不同的语言、库和数据存储,但这样做会使应用程序变得难以维护。例如,如果软件堆栈完全不同,那么开发人员从一个团队转移到另一个团队就会变得更具挑战性。想想数量庞大的库(每种语言对应一个库),需要支持这些库来提供所有服务所需的通用功能,例如日志记录。
Nothing forbids the use of different languages, libraries, and datastores in each microservice, but doing so transforms the application into an unmaintainable mess. For example, it becomes more challenging for a developer to move from one team to another if the software stack is completely different. And think of the sheer number of libraries, one for each language adopted, that need to be supported to provide common functionality that all services need, like logging.
那么需要一定程度的标准化才是合理的。实现这一目标的一种方法是,在仍然允许一定程度的自由的同时,通过为坚持使用推荐的语言和技术组合的团队提供良好的开发体验来宽松地鼓励特定技术。
It’s only reasonable then that a certain degree of standardization is needed. One way to do that, while still allowing some degree of freedom, is to loosely encourage specific technologies by providing a great development experience for the teams that stick with the recommended portfolio of languages and technologies.
资源配置
Resource provisioning
为了支持大量独立服务,启动新机器、数据存储和其他商品资源应该很简单——您不希望每个团队都想出自己的方法。一旦配置了这些资源,就必须对其进行配置。为了能够实现这一目标,您将需要相当程度的自动化。
To support a large number of independent services, it should be simple to spin up new machines, data stores, and other commodity resources — you don’t want every team to come up with their own way of doing it. And once these resources have been provisioned, they have to be configured. To be able to pull this off, you will need a fair amount of automation.
沟通
Communication
远程调用的成本很高,并且伴随着我们在本书前面讨论过的所有注意事项。您将需要防御机制来防止故障,并利用异步和批处理来减轻网络通信对性能的影响。所有这些都增加了系统的复杂性。
Remote calls are expensive and come with all the caveats we discussed earlier in the book. You will need defense mechanisms to protect against failures and leverage asynchrony and batching to mitigate the performance hit of communicating across the network. All of this increases the system’s complexity.
本书中描述的大部分内容都是关于处理这种复杂性的,而且现在应该很清楚,它的成本并不便宜。话虽这么说,即使是单体应用也不是孤立存在的,因为它可以被远程客户端访问,而且它也可能使用第三方 API。因此,最终这些问题也需要在那里得到解决,尽管规模较小。
Much of what is described in this book is about dealing with this complexity, and as it should be clear by now, it doesn’t come cheap. That being said, even a monolith doesn’t live in isolation since it’s being accessed by remote clients, and it’s likely to use third-party APIs as well. So eventually, these issues need to be solved there as well, albeit on a smaller scale.
持续集成、交付和部署
Continuous integration, delivery, and deployment
持续集成可确保在自动构建和测试套件运行后将代码更改合并到主分支中。合并代码更改后,它应该自动发布并部署到类似生产的环境,在该环境中运行一系列集成和端到端测试,以确保服务不会破坏任何依赖项或用例。
Continuous integration ensures that code changes are merged into the main branch after an automated build and test suites have run. Once a code change has been merged, it should be automatically published and deployed to a production-like environment, where a battery of integration and end-to-end tests run to ensure that the service doesn’t break any dependencies or use cases.
虽然测试单个微服务并不比测试整体更具挑战性,但测试所有微服务的集成却要困难一个数量级。当各个服务相互交互时,可能会出现非常微妙和意外的行为。
While testing individual microservices is not more challenging than testing a monolith, testing the integration of all the microservices is an order of magnitude harder. Very subtle and unexpected behavior can emerge when individual services interact with each other.
运营
Operations
与整体架构不同,为每个负责服务的团队配备自己的运营团队的成本要高得多。因此,开发服务的团队通常也会随时待命。这会在添加新功能和运营服务之间产生摩擦,因为团队需要决定在每个冲刺期间优先考虑哪些内容。
Unlike with a monolith, it’s much more expensive to staff each team responsible for a service with its own operations team. As a result, the team that develops a service is typically also on-call for it. This creates friction between adding new features and operating the service as the team needs to decide what to prioritize during each sprint.
调试系统故障也变得更具挑战性,因为您不能仅将整个应用程序加载到本地计算机上并使用调试器单步调试它。由于有更多的移动部件,系统有更多的失败方式。这就是为什么良好的日志记录和监控变得至关重要。
Debugging systems failures becomes more challenging as well, as you can’t just load the whole application on your local machine and step through it with a debugger. The system has more ways to fail, since there are more moving parts. This is why good logging and monitoring becomes crucial.
最终一致性
Eventual consistency
将应用程序拆分为单独的服务的副作用是数据模型不再驻留在单个数据存储中。正如我们在前面的章节中了解到的,自动更新存储在不同数据存储中的记录并保证强一致性是缓慢、昂贵且难以正确完成的。因此,这种类型的架构通常需要拥抱最终一致性。
A side effect of splitting an application into separate services is that the data model no longer resides in a single data store. As we have learned in previous chapters, atomically updating records stored in different data stores, and guaranteeing strong consistency, is slow, expensive, and hard to get right. Hence, this type of architecture usually requires embracing eventual consistency.
将应用程序拆分为服务会增加整个系统的复杂性。因此,通常最好从整体开始,只有在有充分理由时才将其拆分。
Splitting an application into services adds a lot of complexity to the overall system. Because of that, it’s generally best to start with a monolith and split it up only when there is a good reason to do so.
在服务之间确定正确的边界是一项挑战——在一个整体中移动它们要容易得多,直到找到最佳位置。一旦整体架构成熟并且成长的烦恼开始出现,那么您就可以开始一次剥离一个微服务。
Getting the boundaries right between the services is challenging — it’s much easier to move them around within a monolith until you find a sweet spot. Once the monolith is well matured and growing pains start to rise, then you can start to peel off one microservice at a time from it.
只有当您已经拥有微服务优先方法的经验,并且您已经为其构建了一个平台或者已经考虑了构建一个平台所需的时间时,您才应该开始使用微服务优先方法。
You should only start with a microservice-first approach if you already have experience with it, and you either have built out a platform for it or have accounted for the time it will take you to build one.
将应用程序拆分为一组服务(每个服务都有自己的 API)后,您需要重新考虑客户端与应用程序的通信方式。客户端可能需要对不同的服务执行多个请求,以获取完成特定操作所需的所有信息。这对于移动设备来说可能非常昂贵,因为每个网络请求都会消耗宝贵的电池寿命。
After you have split an application into a set of services, each with its own API, you need to rethink how clients communicate with the application. A client might need to perform multiple requests to different services to fetch all the information it needs to complete a specific operation. This can be very expensive on mobile devices where every network request consumes precious battery life.
此外,客户端需要了解实现细节,例如所有内部服务的 DNS 名称。这使得更改应用程序的架构变得具有挑战性,因为它可能需要升级所有客户端。更糟糕的是,如果客户端分发给个人消费者(例如,App Store 上的应用程序),可能没有一种简单的方法来强制他们全部升级到新版本。最重要的是,一旦公开 API 出现,您最好做好长期维护它的准备。
Moreover, clients need to be aware of implementation details, like the DNS names of all the internal services. This makes it challenging to change the application’s architecture as it could require all clients to be upgraded. To make matters worse, if clients are distributed to individual consumers (e.g., an app on the App Store), there might not be an easy way to force them all to upgrade to a new version. The bottom line is that once a public API is out there, you better be prepared to maintain it for a very long time.
正如计算机科学中的典型情况一样,我们可以通过添加间接层来解决这个问题。内部 API 可以通过充当内部服务的外观或代理的公共 API 来隐藏(参见图12.3)。公开公共 API 的服务称为API 网关,它对客户端来说是透明的,因为它们不知道自己正在通过中介进行通信。
As is typical in computer science, we can solve this problem by adding a layer of indirection. The internal APIs can be hidden by a public one that acts as a facade, or proxy, for the internal services (see Figure 12.3). The service that exposes the public API is called the API gateway, which is transparent to its clients since they have no idea they are communicating through an intermediary.
图 12.3:API 网关向客户端隐藏内部 API。
Figure 12.3: The API gateway hides the internal APIs from its clients.
API网关提供路由、组合、翻译等多种功能。
The API gateway provides multiple features, like routing, composition, and translation.
API网关可以将其收到的请求路由到适当的后端服务。它借助路由映射来实现这一点,路由映射将外部 API 映射到内部 API。例如,映射可能在外部路径和内部路径之间具有 1:1 映射。如果将来内部路径发生变化,公共API可以继续暴露旧路径以保证向后兼容性。
The API gateway can route the requests it receives to the appropriate backend service. It does so with the help of a routing map, which maps the external APIs to the internal ones. For example, the map might have a 1:1 mapping between an external path and internal one. If in the future the internal path changes, the public API can continue to expose the old path to guarantee backward compatibility.
虽然单体应用程序的数据通常驻留在单个数据存储中,但在分布式系统中,它分布在多个服务中。因此,某些用例可能需要将多个源的数据拼接在一起。API 网关可以提供更高级别的 API,该 API 查询多个服务并将其响应组合在一个响应中,然后返回给客户端。这使客户端无需了解要查询哪些服务,并减少了获取所需数据所需执行的请求数量。
While data of a monolithic application typically resides in a single data store, in a distributed system, it’s spread across multiple services. As such, some use cases might require stitching data back together from multiple sources. The API gateway can offer a higher-level API that queries multiple services and composes their responses within a single one that is then returned to the client. This relieves the client from knowing which services to query and reduces the number of requests it needs to perform to get the data it needs.
构图可能很难正确。随着内部调用数量的增加,组合 API 的可用性会降低,因为每个调用都有非零的失败概率。此外,服务之间的数据可能不一致,因为某些更新可能尚未传播到所有服务;在这种情况下,网关必须以某种方式解决这种差异。
Composition can be hard to get right. The availability of the composed API decreases as the number of internal calls increases since each has a non-zero probability of failure. Additionally, the data across the services might be inconsistent as some updates might not have propagated to all services yet; in that case, the gateway will have to somehow resolve this discrepancy.
API网关可以从一种IPC机制转换为另一种IPC机制。例如,它可以将 RESTful HTTP 请求转换为内部 gRPC 调用。
The API gateway can translate from one IPC mechanism to another. For example, it can translate a RESTful HTTP request into an internal gRPC call.
网关还可以向不同类型的客户端公开不同的API。例如,桌面应用程序的 Web API 可能会比移动应用程序返回更多的数据,因为屏幕空间更大并且可以一次呈现更多信息。此外,网络调用对于移动客户端来说非常昂贵,并且通常需要对请求进行批处理以减少电池使用。
The gateway can also expose different APIs to different types of clients. For example, a web API for a desktop application can potentially return more data than the one for a mobile application, as the screen estate is larger and more information can be presented at once. Also, network calls are expensive for mobile clients, and requests generally need to be batched to reduce battery usage.
为了满足这些不同且相互竞争的需求,网关可以提供针对不同用例定制的不同 API,并将这些 API 转换为内部 API。针对个别用例定制 API 的一种越来越流行的方法是使用基于图形的 API。基于图的 API公开由类型、字段和跨类型关系组成的架构。API 允许客户端声明它需要什么数据,并让网关弄清楚如何将请求转换为一系列内部 API 调用。
To meet these different and competing requirements, the gateway can provide different APIs tailored to different use cases and translate these APIs to the internal ones. An increasingly popular approach to tailor APIs to individual use cases is to use graph-based APIs. A graph-based API exposes a schema composed of types, fields, and relationships across types. The API allows a client to declare what data it needs and let the gateway figure out how to translate the request into a series of internal API calls.
这种方法减少了开发时间,因为无需为不同的用例引入不同的 API,并且客户可以自由指定他们需要的内容。不过,仍然有一个 API;只是碰巧它是用图形模式描述的。在某种程度上,就好像网关授予客户端在其后端 API 上执行受限查询的能力。在撰写本文时,GraphQL是该领域最流行的技术。
This approach reduces the development time as there is no need to introduce different APIs for different use cases, and the clients are free to specify what they need. There is still an API, though; it just happens that it’s described with a graph schema. In a way, it’s as if the gateway grants the clients the ability to perform restricted queries on its backend APIs. GraphQL is the most popular technology in the space at the time of writing.
由于 API 网关是其背后服务的代理或中间人,因此它还可以实现横切功能,否则必须在每个服务中重新实现这些功能。例如,API网关可以缓存经常访问的资源以提高API的性能,同时减少对服务的带宽要求或速率限制请求以保护服务不被淹没。
As the API gateway is a proxy, or middleman, for the services behind it, it can also implement cross-cutting functionality that otherwise would have to be re-implemented in each service. For example, the API gateway could cache frequently accessed resources to improve the API’s performance while reducing the bandwidth requirements on the services or rate-limit requests to protect the services from being overwhelmed.
在确保服务安全的最关键的交叉方面中,身份验证和授权是首要考虑的。身份验证是验证从客户端发出请求的所谓主体(人或应用程序)是否真实的过程。相反,授权是授予经过身份验证的主体权限以执行特定操作(例如创建、读取、更新或删除特定资源)的过程。通常,这是通过向主体分配一个或多个授予特定权限的角色来实现的。或者,可以使用访问控制列表来授予特定主体对特定资源的访问权限。
Among the most critical cross-cutting aspects of securing a service, authentication and authorization are top-of-mind. Authentication is the process of validating that a so-called principal — a human or an application — issuing a request from a client is who it says it is. Authorization instead is the process of granting the authenticated principal permissions to perform specific operations, like creating, reading, updating, or deleting a particular resource. Typically this is implemented by assigning a principal one or more roles that grant specific permissions. Alternatively, an access control list can be used to grant specific principals access to specific resources.
单体应用程序可以使用会话令牌实现身份验证和授权。客户端将其凭据发送到应用程序 API 的登录端点,该端点将验证凭据。如果成功,端点通常通过 HTTP cookie 将会话令牌1返回给客户端。然后,客户端将令牌包含在所有未来的请求中。
A monolithic application can implement authentication and authorization with session tokens. A client sends its credentials to the application API’s login endpoint, which validates the credentials. If that’s successful, the endpoint returns a session token1 to the client, typically through an HTTP cookie. The client then includes the token in all future requests.
应用程序使用会话令牌从内存缓存或外部数据存储中检索会话对象。该对象包含主体的 ID 和授予它的角色,应用程序的 API 处理程序使用它们来决定是否允许主体执行操作。
The application uses the session token to retrieve a session object from an in-memory cache or an external data store. The object contains the principal’s ID and the roles granted to it, which are used by the application’s API handlers to decide whether to allow the principal to perform an operation or not.
将这种方法转化为微服务架构并不是那么简单。例如,哪个服务应该负责验证和授权请求并不明显,因为请求的处理可以跨越多个服务。
Translating this approach to a microservice architecture is not that straightforward. For example, it’s not obvious which service should be responsible for authenticating and authorizing requests, as the handling of requests can span multiple services.
一种方法是让 API 网关负责验证外部请求,因为这是它们的入口点。这允许将支持不同身份验证机制的逻辑集中到单个组件中,从而隐藏内部服务的复杂性。相反,授权请求最好留给各个服务,以避免 API 网关与其域逻辑耦合。
One approach is to have the API gateway be responsible for authenticating external requests, since that’s their point of entry. This allows centralizing the logic to support different authentication mechanisms into a single component, hiding the complexity from internal services. In contrast, authorizing requests is best left to individual services to avoid coupling the API gateway with their domain logic.
当 API 网关对请求进行身份验证后,它会创建一个安全令牌。网关将此令牌传递给负责处理请求的内部服务,后者又将其传递给下游的依赖项(参见图12.4)。
When the API gateway has authenticated a request, it creates a security token. The gateway passes this token to the internal services responsible for handling the request, which in turn will pass it downstream to their dependencies (see Figure 12.4).
图 12.4: 。API 客户端将带有凭据的请求发送到 API 网关。API 网关尝试使用 auth service 来验证凭据。身份验证服务验证凭据并使用安全令牌进行回复。API 网关向服务 A 发送包含安全令牌的请求。API 网关向服务 B 发送包含安全令牌的请求。API网关将A和B的响应组合并回复给客户端
Figure 12.4: . API client sends a request with credentials to API gateway . API gateway tries to authenticate credentials with auth service . Auth service validates credentials and replies with a security token . API gateway sends a request to service A including the security token . API gateway sends a request to service B including the security token . API gateway composes the responses from A and B and replies to the client
当内部服务收到附有安全令牌的请求时,它需要有一种方法来验证该请求并获取主体的身份及其角色。验证根据所使用的令牌类型而有所不同,令牌可以是不透明的且不包含任何信息(例如,UUID),也可以是透明的并将委托人的信息嵌入到令牌本身中。
When an internal service receives a request with a security token attached to it, it needs to have a way to validate it and obtain the principal’s identity and its roles. The validation differs depending on the type of token used, which can be either opaque and not contain any information (e.g., an UUID), or be transparent and embed the principal’s information within the token itself.
不透明令牌的缺点是它们需要服务调用外部身份验证服务来验证令牌并检索主体的信息。透明代币消除了这种调用,但代价是使撤销已发行的落入坏人之手的代币变得更加困难。
The downside of opaque tokens is that they require services to call an external auth service to validate a token and retrieve the principal’s information. Transparent tokens eliminate that call at the expense of making it harder to revoke issued tokens that have fallen into the wrong hands.
最流行的透明令牌标准是JSON Web 令牌(JWT)。JWT 是一个 JSON 有效负载,其中包含到期日期、主体身份、角色和其他元数据。有效负载使用内部服务信任的证书进行签名。因此,不需要外部调用来验证令牌。
The most popular standard for transparent tokens is the JSON Web Token (JWT). A JWT is a JSON payload that contains an expiration date, the principal’s identity, roles, and other metadata. The payload is signed with a certificate trusted by internal services. Hence, no external calls are needed to validate the token.
OpenID Connect和OAuth 2是可用于实现基于令牌的身份验证和授权的安全协议。我们仅仅触及了这个主题的表面,并且您可以阅读有关该主题的整本书以了解更多信息。
OpenID Connect and OAuth 2 are security protocols that you can use to implement token-based authentication and authorization. We have barely scratched the surface on the topic, and there are entire books written on the subject you can read to learn more about it.
另一种广泛使用的应用程序身份验证机制是使用 API 密钥。API 密钥是一个自定义密钥,允许 API 网关识别哪个应用程序正在发出请求并限制它们可以执行的操作。这种方法在公共 API 中很流行,例如 Github 或 Twitter 提供的 API。
Another widespread mechanism to authenticate applications is the use API keys. An API key is a custom key that allows the API gateway to identify which application is making a request and limit what they can do. This approach is popular for public APIs, like the one offered by Github or Twitter.
使用 API 网关的缺点之一是它可能成为开发瓶颈。由于它与其隐藏的服务相结合,因此创建的每个新服务都需要连接到它。另外,每当服务的API发生变化时,网关也需要修改。
One of the drawbacks of using an API gateway is that it can become a development bottleneck. As it’s coupled with the services it’s hiding, every new service that is created needs to be wired up to it. Additionally, whenever the API of service changes, the gateway needs to be modified as well.
另一个缺点是 API 网关又是一项需要开发、维护和运营的服务。此外,它需要能够扩展到其背后的所有服务的任何请求率。也就是说,如果一个应用程序有数十个服务和 API,则好处大于坏处,而且通常是值得的投资。
The other downside is that the API gateway is one more service that needs to be developed, maintained, and operated. Also, it needs to be able to scale to whatever the request rate is for all the services behind it. That said, if an application has dozens of services and APIs, the upside is greater than the downside and it’s generally a worthwhile investment.
那么如何实施网关呢?您可以使用代理框架(例如NGINX )作为起点,推出自己的 API 网关。或者更好的是,您可以使用现成的解决方案,例如Azure API 管理。
So how do you go about implementing a gateway? You can roll your own API gateway, using a proxy framework as a starting point, like NGINX. Or better yet, you can use an off-the-shelf solution, like Azure API Management.
API 网关组合内部 API 的能力非常有限,如果组合需要大量内存中连接,则查询跨服务分布的数据可能会非常低效。
The API’s gateway ability to compose internal APIs is quite limited, and querying data distributed across services can be very inefficient if the composition requires large in-memory joins.
由于与使用微服务架构无关的原因,访问数据也可能效率低下:
Accessing data can also be inefficient for reasons that have nothing to do with using a microservice architecture:
在这些情况下,将读取路径与写入路径解耦可以带来巨大的好处。此方法也称为命令查询职责分离(CQRS) 模式。
In these cases, decoupling the read path from the write path can yield substantial benefits. This approach is also referred to as the Command Query Responsibility Segregation (CQRS) pattern.
这两条路径可以使用适合其特定用例的不同数据模型和数据存储(见图12.5)。例如,读取路径可以使用针对应用程序所需的特定查询模式(例如地理空间或基于图形)定制的专用数据存储。
The two paths can use different data models and data stores that fit their specific use cases (see Figure 12.5). For example, the read path could use a specialized data store tailored to a particular query pattern required by the application, like geospatial or graph-based.
图 12.5:在此示例中,读取和写入路径被分为不同的服务。
Figure 12.5: In this example, the read and write paths are separated out into different services.
为了保持读写数据模型同步,只要数据发生变化,写入路径就会将更新推送到读取路径。外部客户端仍然可以使用写入路径进行简单查询,但复杂查询将路由到读取路径。
To keep the read and write data models synchronized, the write path pushes updates to the read path whenever the data changes. External clients could still use the write path for simple queries, but complex queries are routed to the read path.
这种分离增加了系统的复杂性。例如,当数据模型发生变化时,两条路径可能都需要更新。同样,随着需要维护和操作的移动部件增多,运营成本也会增加。此外,在写入路径上应用更改的时间与读取路径接收并应用更改的时间之间存在固有的复制滞后,这使得系统顺序一致。
This separation adds more complexity to the system. For example, when the data model changes, both paths might need to be updated. Similarly, operational costs increase as there are more moving parts to maintain and operate. Also, there is an inherent replication lag between the time a change has been applied on the write path and the read path has received and applied it, which makes the system sequentially consistent.
当应用程序分解为服务时,网络调用的数量会增加,请求的目的地可能会暂时不可用。到目前为止,我们主要假设服务使用直接请求响应通信方式进行通信,这要求目的地可用并及时响应。然而,消息传递——一种间接通信的形式——没有这个要求。
When an application is decomposed into services, the number of network calls increases, and with it, the probability that a request’s destination is momentarily unavailable. So far, we have mostly assumed services communicate using a direct request-response communication style, which requires the destination to be available and respond promptly. Messaging — a form of indirect communication — doesn’t have this requirement, though.
当我们在11.4.1节中讨论异步事务的实现时,首次介绍了消息传递。它是一种间接通信形式,其中生产者将消息写入通道(或消息代理),后者将消息传递给另一端的消费者。
Messaging was first introduced when we discussed the implementation of asynchronous transactions in section 11.4.1. It is a form of indirect communication in which a producer writes a message to a channel — or message broker — that delivers the message to a consumer on the other end.
通过将生产者与消费者解耦,前者可以获得与后者通信的能力,即使后者暂时不可用。消息传递还提供了其他一些好处:
By decoupling the producer from the consumer, the former gains the ability to communicate with the latter even if it’s temporarily unavailable. Messaging provides several other benefits:
由于生产者和消费者之间存在额外的跃点,因此通信延迟必然会更高,如果通道有大量积压的消息等待处理,则更是如此。此外,系统的复杂性随着多了一项服务(即消息代理)需要维护和操作而增加——一如既往,这一切都与权衡有关。
Because there is an additional hop between the producer and consumer, the communication latency is necessarily going to be higher, more so if the channel has a large backlog of messages waiting to be processed. Additionally, the system’s complexity increases as there is one more service, the message broker, that needs to be maintained and operated — as always, it’s all about tradeoffs.
任意数量的生产者都可以将消息写入通道,同样,多个消费者可以从中读取消息。根据通道向消费者传递消息的方式,可以将其分类为点对点通道或发布-订阅通道。在点对点通道中,特定消息被传递给恰好一个消费者。相反,在发布-订阅通道中,同一消息的副本将传递给所有消费者。
Any number of producers can write messages to a channel, and similarly, multiple consumers can read from it. Depending on how the channel delivers messages to consumers, it can be classified as either point-to-point or publish-subscribe. In a point-to-point channel, a specific message is delivered to exactly one consumer. Instead, in a publish-subscribe channel, a copy of the same message is delivered to all consumers.
消息通道可用于多种不同的通信方式。
A message channel can be used for a variety of different communication styles.
单向消息传递
One-way messaging
在这种消息传递风格中,生产者将消息写入点对点通道,并期望消费者最终会读取并处理该消息(见图12.6)。
In this messaging style, the producer writes a message to a point-to-point channel with the expectation that a consumer will eventually read and process it (see Figure 12.6).
图 12.6:单向消息传递风格
Figure 12.6: One-way messaging style
请求-响应消息传递
Request-response messaging
这种消息传递风格类似于我们熟悉的直接请求-响应风格,尽管不同之处在于请求和响应消息通过通道流动。消费者有一个点对点的请求通道,可以从中读取消息,而每个生产者都有自己专用的响应通道(见图12.7)。
This messaging style is similar to the direct request-response style we are familiar with, albeit with the difference that the request and response messages flow through channels. The consumer has a point-to-point request channel from which it reads messages, while every producer has its own dedicated response channel (see Figure 12.7).
当生产者将消息写入请求通道时,它会使用请求 id 和对其响应通道的引用来装饰该消息。消费者读取并处理消息后,会将回复写入生产者的响应通道,并使用请求的 id 对其进行标记,这使得生产者可以识别其所属的请求。
When a producer writes a message to the request channel, it decorates it with a request id and a reference to its response channel. After a consumer has read and processed the message, it writes a reply to the producer’s response channel, tagging it with the request’s id, which allows the producer to identify the request it belongs to.
图 12.7:请求-响应消息传递风格
Figure 12.7: Request-response messaging style
广播消息
Broadcast messaging
在这种消息传递风格中,生产者将消息写入发布-订阅通道,以将其广播给所有消费者(见图12.8)。该机制通常用于通知一组进程发生了特定事件。我们在第 11.4.1节讨论基于日志的事务时已经遇到过这种模式。
In this messaging style, a producer writes a message to a publish-subscribe channel to broadcast it to all consumers (see Figure 12.8). This mechanism is generally used to notify a group of processes that a specific event has occurred. We have already encountered this pattern when discussing log-based transactions in section 11.4.1.
图 12.8:广播消息传递风格
Figure 12.8: Broadcast messaging style
消息通道由消息服务实现,例如AWS SQS或Kafka。消息传递服务或代理充当消息的缓冲区。它将生产者与消费者解耦,这样他们就不需要知道消费者的地址、有多少个消费者,或者它们是否可用。
A message channel is implemented by a messaging service, like AWS SQS or Kafka. The messaging service, or broker, acts as a buffer for messages. It decouples producers from consumers so that they don’t need to know the consumers’ addresses, how many of them there are, or whether they are available.
不同的消息代理根据其提供的权衡和保证以不同的方式实现通道抽象。例如,您可能认为通道应该尊重其消息的插入顺序,但您会发现某些实现(例如SQS 标准队列)不提供任何强大的排序保证。这是为什么?
Different message brokers implement the channel abstraction differently depending on the tradeoffs and the guarantees they offer. For example, you would think that a channel should respect the insertion order of its messages, but you will find that some implementations, like SQS standard queues, don’t offer any strong ordering guarantees. Why is that?
由于消息代理需要像使用它的应用程序一样进行扩展,因此它的实现必然是分布式的。当涉及多个节点时,保证顺序变得具有挑战性,因为需要某种形式的协调。一些代理(例如 Kafka)将一个通道划分为多个子通道,每个子通道都足够小,可以完全由单个进程处理。这个想法是,如果有一个代理进程负责子通道的消息,那么保证它们的顺序应该很简单。
Because a message broker needs to scale out just like the applications that use it, its implementation is necessarily distributed. And when multiple nodes are involved, guaranteeing order becomes challenging as some form of coordination is required. Some brokers, like Kafka, partition a channel into multiple sub-channels, each small enough to be handled entirely by a single process. The idea is that if there is a single broker process responsible for the messages of a sub-channel, then it should be trivial to guarantee their order.
在这种情况下,当消息发送到通道时,它们会根据分区键分为子通道。为了保证端到端地保留消息顺序,只能允许单个消费者进程从子通道2中读取。
In this case, when messages are sent to the channel, they are partitioned into sub-channels based on a partition key. To guarantee that the message order is preserved end-to-end, only a single consumer process can be allowed to read from a sub-channel2.
由于通道是分区的,因此它有几个缺点。例如,特定分区可能会比其他分区更热,并且从中读取数据的单个使用者可能无法跟上负载。在这种情况下,通道需要重新分区,这可能会暂时降低代理的性能,因为消息需要在所有分区之间重新洗牌。在本章后面,我们将更多地了解分区的优点和缺点。
Because the channel is partitioned, it suffers from several drawbacks. For example, a specific partition can become much hotter than the others, and the single consumer reading from it might not be able to keep up with the load. In that case, the channel needs to be repartitioned, which can temporarily degrade the broker since messages need to be reshuffled across all partitions. Later in the chapter, we will learn more about the pros and cons of partitioning.
现在您知道为什么不必保证消息的顺序可以使代理的实现变得更加简单。订购只是经纪商需要做出的众多权衡之一,例如:
Now you see why not having to guarantee the order of messages makes the implementation of a broker much simpler. Ordering is just one of the many tradeoffs a broker needs to make, such as:
由于实现通道的方法有很多不同,因此为了简单起见,在本节的其余部分中我们将做出一些假设:
Because there are so many different ways to implement channels, in the rest of this section we will make some assumptions for the sake of simplicity:
上述保证与Amazon 的 SQS和Azure 存储队列等云服务提供的保证非常相似。
The above guarantees are very similar to what cloud services such as Amazon’s SQS and Azure Storage Queues offer.
如前所述,消费者在处理完消息后必须从通道中删除消息,以免其他消费者读取该消息。
As mentioned, a consumer has to delete a message from the channel once it’s done processing it so that it won’t be read by another consumer.
如果消费者在处理消息之前删除消息,那么在删除消息之后和处理消息之前,消息可能会崩溃,从而导致消息永久丢失。另一方面,如果消费者在处理消息后才删除消息,则存在消费者在处理消息后但在删除消息之前崩溃的风险,导致稍后再次读取相同的消息。
If the consumer deletes the message before processing it, there is a risk it could crash after deleting the message and before processing it, causing the message to be lost for good. On the other hand, if the consumer deletes the message only after processing it, there is a risk that the consumer might crash after processing the message but before deleting it, causing the same message to be read again later on.
因此,不存在“一次性消息传递”这样的事情。消费者能做的最好的事情就是通过要求消息是幂等的来模拟一次消息处理。
Because of that, there is no such thing as exactly-once message delivery. The best a consumer can do is to simulate exactly-once message processing by requiring messages to be idempotent.
当某个消费者处理消息失败时,就会触发可见性超时,消息最终会被传递给另一个消费者。但是,如果处理特定消息始终失败并出现错误,会发生什么情况?为了防止消息被永久重复拾取,我们需要限制从通道读取同一消息的最大次数。
When a consumer fails to process a message, the visibility timeout triggers, and the message is eventually delivered to another consumer. What happens if processing a specific message consistently fails with an error, though? To guard against the message being picked up repeatedly in perpetuity, we need to limit the maximum number of times the same message can be read from the channel.
为了强制执行最大重试次数,代理可以使用计数器来标记消息,该计数器跟踪消息已传递给消费者的次数。如果代理不支持开箱即用的此功能,则可以由消费者实现。
To enforce a maximum number of retries, the broker can stamp messages with a counter that keeps track of the number of times the message has been delivered to a consumer. If the broker doesn’t support this functionality out of the box, it can be implemented by the consumers.
一旦您有办法计算消息重试的次数,您仍然必须决定在达到最大值时该怎么做。消费者不应在未处理消息的情况下删除消息,因为这会导致数据丢失。但它可以做的是将消息写入死信通道后从通道中删除消息——死信通道充当已重试次数过多的消息的缓冲区。
Once you have a way to count the number of times a message has been retried, you still have to decide what to do when the maximum is reached. A consumer shouldn’t delete a message without processing it, as that would cause data loss. But what it can do is remove the message from the channel after writing it to a dead letter channel — a channel that acts as a buffer for messages that have been retried too many times.
这样,持续失败的消息不会永远丢失,而只是放在一边,这样它们就不会污染主通道,浪费消费者的处理资源。然后,人们可以检查这些消息来调试故障,一旦确定并修复根本原因,将它们移回主通道进行重新处理。
This way, messages that consistently fail are not lost forever but merely put on the side so that they don’t pollute the main channel, wasting consumers’ processing resources. A human can then inspect these messages to debug the failure, and once the root cause has been identified and fixed, move them back to the main channel to be reprocessed.
使用消息代理的主要优点之一是它使系统对于中断更加稳健。即使一个或多个消费者不可用或性能下降,生产者也可以继续向通道写入消息。只要消息的到达速率低于或等于它们从通道中删除的速率,一切都很好。当情况不再如此,并且消费者无法跟上生产者的步伐时,积压的订单就会开始增加。
One of the main advantages of using a messaging broker is that it makes the system more robust to outages. Producers can continue to write messages to a channel even if one or more consumers are not available or are degraded. As long as the rate of arrival of messages is lower or equal to the rate they are being deleted from the channel, everything is great. When that is no longer true, and consumers can’t keep up with producers, a backlog starts to build up.
消息传递通道在系统中引入了双模式行为。在一种模式下,没有积压,一切都按预期进行。另一种情况是,积压的工作量不断增加,系统进入降级状态。积压的问题在于,积压的时间越长,耗尽它所需的资源和/或时间就越多。
A messaging channel introduces a bi-modal behavior in the system. In one mode, there is no backlog, and everything works as expected. In the other, a backlog builds up, and the system enters a degraded state. The issue with a backlog is that the longer it builds up, the more resources and/or time it will take to drain it.
造成积压的原因有多种,例如:
There are several reasons for backlogs, for example:
要检测积压,您应该测量消息在通道中等待首次读取的平均时间。通常,代理会附加消息首次写入的时间戳。消费者可以使用该时间戳与读取消息时的时间戳进行比较,来计算消息在通道中等待的时间。尽管这两个时间戳是由两个不完全同步的物理时钟生成的(请参阅第8.1节),但该度量仍然可以很好地指示积压情况。
To detect backlogs, you should measure the average time a message waits in the channel to be read for the first time. Typically, brokers attach a timestamp of when the message was first written to it. The consumer can use that timestamp to compute how long the message has been waiting in the channel by comparing it to the timestamp taken when the message was read. Although the two timestamps have been generated by two physical clocks that aren’t perfectly synchronized (see section 8.1), the measure still provides a good indication of the backlog.
发出反复无法处理的“有毒”消息的特定生产者可能会降低整个系统的性能,并可能导致积压,因为消息在进入死信通道之前会被多次处理。因此,在有问题的生产者开始影响系统的其余部分之前找到处理它们的方法非常重要3。
A specific producer that emits “poisonous” messages that repeatedly fail to be processed can degrade the whole system and potentially cause backlogs, since messages are processed multiple times before they end up in the dead-letter channel. Therefore, it’s important to find ways to deal with problematic producers before they start to affect the rest of the system3.
如果消息属于不同的用户4并用某种标识符修饰,消费者可以决定以不同的方式对待“吵闹”的用户。例如,假设来自特定用户的消息始终失败。在这种情况下,消费者可以决定将这些消息写入备用低优先级通道,并将它们从主通道中删除而不处理它们。消费者可以继续从慢速通道读取,但频率较低。这确保了一个坏用户不会影响其他用户。
If messages belong to different users4 and are decorated with some kind of identifier, consumers can decide to treat “noisy” users differently. For example, suppose messages from a specific user fail consistently. In that case, the consumers could decide to write these messages to an alternate low-priority channel and remove them from the main channel without processing them. The consumers can continue to read from the slow channel, but do so less frequently. This ensures that one single bad user can’t affect others.
传输图像、音频文件或视频等大型二进制对象 (blob) 可能具有挑战性,甚至根本不可能,具体取决于介质。例如,消息代理限制可以写入通道的消息的最大大小;Azure 存储队列将消息限制为 64 KB,AWS Kinesis 限制为 1 MB 等。那么,如何在这些严格的限制下传输数百 MB 的大型 Blob?
Transmitting a large binary object (blob) like images, audio files, or video can be challenging or simply impossible, depending on the medium. For example, message brokers limit the maximum size of messages that can be written to a channel; Azure Storage queues limit messages to 64 KB, AWS Kinesis to 1 MB, etc. So how do you transfer large blobs of hundreds of MBs with these strict limits?
您可以将 Blob 上传到对象存储服务(例如 AWS S3 或 Azure Blob Storage),然后通过消息发送 Blob 的 URL(此模式有时称为“队列加 Blob”)。缺点是现在您必须处理两个服务:消息代理和对象存储,而不仅仅是消息代理,这增加了系统的复杂性。
You can upload a blob to an object storage service, like AWS S3 or Azure Blob Storage, and then send the URL of the blob via message (this pattern is sometimes referred to as queue plus blob). The downside is that now you have to deal with two services, the message broker and the object store, rather than just the message broker, which increases the system’s complexity.
类似的方法可用于在数据库中存储大型 blob,而不是将 blob 直接存储在数据库中,您只需存储一些包含对实际 blob 的外部引用的元数据。该解决方案的优点在于,它最大限度地减少了数据存储中来回传输的数据,提高了性能,同时减少了所需的带宽。此外,设计用于持久保存不经常更改(如果有的话)的大型对象的对象存储的每字节成本低于通用数据存储的成本。
A similar approach can be used to store large blobs in databases — rather than storing a blob in a database directly, you only store some metadata containing an external reference to the actual blob. The advantage of this solution is that it minimizes data being transferred back and forth to and from the data store, improving its performance while reducing the required bandwidth. Also, the cost per byte of an object store designed to persist large objects that infrequently change, if at all, is lower than the one of a generic data store.
当然,缺点是您无法使用 Blob 的元数据以及数据存储中可能的其他记录以事务方式更新 Blob。例如,假设事务在包含图像的数据存储中插入一条新记录。在这种情况下,在交易完成之前图像将不可见;但是,如果图像存储在外部存储中,情况就不会如此。同样,如果以后删除该记录,该图像也会自动删除;但如果该图片位于商店之外,则您有责任将其删除。
Of course, the downside is that you lose the ability to transactionally update the blob with its metadata and potentially other records in the data store. For example, suppose a transaction inserts a new record in the data store containing an image. In this case, the image won’t be visible until the transaction completes; that won’t be the case if the image is stored in an external store, though. Similarly, if the record is later deleted, the image is automatically deleted as well; but if the image lives outside the store, it’s your responsibility to delete it.
将 blob 存储在数据存储之外是否可接受取决于您的具体用例。
Whether storing blobs outside of your data store is acceptable or not depends on your specific use cases.
例如,加密强度强的随机数↩︎
e.g., a cryptographically-strong random number↩︎
这也称为竞争消费者模式,它是通过领导者选举来实现的↩︎
This is also referred to as the competing consumer pattern, which is implemented using leader election↩︎
这些生产者也被称为吵闹的邻居↩︎
These producers are also referred to as noisy neighbors↩︎
用户可以是人类或应用程序。↩︎
A user can be a human or an application.↩︎
现在是时候换档并深入使用您可以使用的另一个工具来扩展应用程序 - 分区或分片。
Now it’s time to change gears and dive into another tool you have at your disposal to scale out application — partitioning or sharding.
当数据集不再适合单个节点时,需要将其划分到多个节点上。分区是一种通用技术,可用于多种情况,例如在负载均衡器的后端之间对 TCP 连接进行分片。为了奠定本章讨论的基础,我们将其锚定在分片键值存储的实现上。
When a dataset no longer fits on a single node, it needs to be partitioned across multiple nodes. Partitioning is a general technique that can be used in a variety of circumstances, like sharding TCP connections across backends in a load balancer. To ground the discussion in this chapter, we will anchor it to the implementation of a sharded key-value store.
当客户端向分区数据存储发送读取或写入密钥的请求时,需要将该请求路由到负责该密钥所属分区的节点。一种方法是使用网关服务,该服务可以将请求路由到正确的位置,了解键如何映射到分区以及分区到节点。
When a client sends a request to a partitioned data store to read or write a key, the request needs to be routed to the node responsible for the partition the key belongs to. One way to do that is to use a gateway service that can route the request to the right place knowing how keys are mapped to partitions and partitions to nodes.
键和分区以及其他元数据之间的映射通常在高度一致的配置存储中维护,例如 etcd 或 Zookeeper。但是键首先是如何映射到分区的呢?在较高层次上,有两种方法可以使用范围分区或散列分区来实现映射。
The mapping between keys and partitions, and other metadata, is typically maintained in a strongly-consistent configuration store, like etcd or Zookeeper. But how are keys mapped to partitions in the first place? At a high level, there are two ways to implement the mapping using either range partitioning or hash partitioning.
使用范围分区,数据按字典顺序按键范围划分为多个分区,每个分区保存一个连续范围的键,如图13.1所示。数据可以按排序顺序存储在磁盘上的每个分区中,从而加快范围扫描速度。
With range partitioning, the data is split into partitions by key range in lexicographical order, and each partition holds a continuous range of keys, as shown in Figure 13.1. The data can be stored in sorted order on disk within each partition, making range scans fast.
图 13.1:范围分区数据集
Figure 13.1: A range partitioned dataset
如果键的分布不均匀(就像在英语词典中一样),那么均匀地分割键范围就没有多大意义。这样做会创建不平衡的分区,其中包含的条目比其他分区多得多。
Splitting the key-range evenly doesn’t make much sense though if the distribution of keys is not uniform, like in the English dictionary. Doing so creates unbalanced partitions that contain significantly more entries than others.
范围分区的另一个问题是某些访问模式可能会导致热点。例如,如果数据集按日期进行范围分区,则当天的所有写入最终都会在同一分区中,这会降低数据存储的性能。
Another issue with range partitioning is that some access patterns can lead to hotspots. For example, if a dataset is range partitioned by date, all writes for the current day end up in the same partition, which degrades the data store’s performance.
哈希分区背后的想法是使用哈希函数将键分配给分区,从而在分区之间重新排列(或均匀分布)键,如图 13.2所示。另一种思考方式是,哈希函数将潜在的非均匀分布的密钥空间映射到均匀分布的哈希空间。
The idea behind hash partitioning is to use a hash function to assign keys to partitions, which shuffles — or uniformly distributes — keys across partitions, as shown in Figure 13.2. Another way to think about it is that the hash function maps a potentially non-uniformly distributed key space to a uniformly distributed hash space.
例如,哈希分区的简单版本可以使用模块化哈希来实现,即 hash(key) mod N。
For example, a simple version of hash partitioning can be implemented with modular hashing, i.e., hash(key) mod N.
图 13.2:哈希分区数据集
Figure 13.2: A hash partitioned dataset
尽管此方法可确保分区包含或多或少相同数量的条目,但如果访问模式不统一,则它无法消除热点。如果有一个键的访问频率显着高于其他键,那么所有的赌注都会被取消。在这种情况下,包含热键的分区需要进一步向下分割。或者,需要将密钥拆分为多个子密钥,例如通过在其末尾添加偏移量。
Although this approach ensures that the partitions contain more or less the same number of entries, it doesn’t eliminate hotspots if the access pattern is not uniform. If there is a single key that is accessed significantly more often than others, then all bets are off. In this case, the partition that contains the hot key needs to be split further down. Alternatively, the key needs to be split into multiple sub-keys, for example, by adding an offset at the end of it.
添加新分区时,使用模块化哈希可能会出现问题,因为所有键都必须跨分区重新排列。混洗数据的成本极其昂贵,因为它会消耗网络带宽和托管分区的节点的其他资源。理想情况下,如果添加分区,则仅钥匙应该被洗牌,其中是键的数量,分区的数量。保证此属性的哈希策略称为稳定哈希。
Using modular hashing can become problematic when a new partition is added, as all keys have to be reshuffled across partitions. Shuffling data is extremely expensive as it consumes network bandwidth and other resources from nodes hosting partitions. Ideally, if a partition is added, only keys should be shuffled around, where is the number of keys and the number of partitions. A hashing strategy that guarantees this property is called stable hashing.
环哈希是稳定哈希的一个例子。通过环散列,函数将键映射到圆上的点。然后,根据具体算法,将圆分成可以均匀或伪随机间隔的分区。当添加新分区时,可以证明大多数键不需要进行洗牌。
Ring hashing is an example of stable hashing. With ring hashing, a function maps a key to a point on a circle. The circle is then split into partitions that can be evenly or pseudo-randomly spaced, depending on the specific algorithm. When a new partition is added, it can be shown that most keys don’t need to be shuffled around.
例如,使用一致性哈希,分区标识符和键都随机分布在一个圆上,每个键都分配给按顺时针顺序出现在圆上的下一个分区(见图13.3 )。
For example, with consistent hashing, both the partition identifiers and keys are randomly distributed on a circle, and each key is assigned to the next partition that appears on the circle in clockwise order (see Figure 13.3).
图 13.3:通过一致性哈希,分区标识符和键随机分布在一个圆上,每个键都分配给按顺时针顺序出现在圆上的下一个分区。
Figure 13.3: With consistent hashing, partition identifiers and keys are randomly distributed on a circle, and each key is assigned to the next partition that appears on the circle in clockwise order.
现在,当添加新分区时,只需重新分配映射到它的键,如图13.4所示。
Now, when a new partition is added, only the keys mapped to it need to be reassigned, as shown in Figure 13.4.
图13.4:添加分区P4后,键“for”被重新分配给P4,但其他键没有重新分配。
Figure 13.4: After partition P4 is added, key ‘for’ is reassigned to P4, but the other keys are not reassigned.
与范围分区相比,散列分区的主要缺点是分区上的排序顺序丢失。但是,单个分区内的数据仍然可以根据辅助键进行排序。
The main drawback of hash partitioning compared to range partitioning is that the sort order over the partitions is lost. However, the data within an individual partition can still be sorted based on a secondary key.
当对数据存储的请求数量变得太大,或者数据集的大小变得太大时,需要增加为分区提供服务的节点数量。同样,如果数据集的大小不断缩小,则可以减少节点数量以降低成本。添加和删除节点以平衡系统负载的过程称为重新平衡。
When the number of requests to the data store becomes too large, or the dataset’s size becomes too large, the number of nodes serving partitions needs to be increased. Similarly, if the dataset’s size keeps shrinking, the number of nodes can be decreased to reduce costs. The process of adding and removing nodes to balance the system’s load is called rebalancing.
重新平衡需要以最小化对数据存储的干扰的方式实施,数据存储需要继续服务请求。因此,需要最大限度地减少重新平衡过程中传输的数据量。
Rebalancing needs to be implemented in such a way to minimize disruption to the data store, which needs to continue to serve requests. Hence, the amount of data transferred during the rebalancing act needs to be minimized.
在这里,我们的想法是在首次初始化数据存储时创建比所需更多的分区,并为每个节点分配多个分区。当有新节点加入时,部分分区会从现有节点移动到新节点,从而使存储始终处于平衡状态。
Here, the idea is to create way more partitions than necessary when the data store is first initialized and assign multiple partitions per node. When a new node joins, some partitions move from the existing nodes to the new one so that the store is always in a balanced state.
这种方法的缺点是分区数量是在数据存储首次初始化时设置的,之后不能轻易更改。分区数量错误可能会产生问题 - 太多的分区会增加开销并降低数据存储的性能,而太少的分区会限制数据存储的可扩展性。
The drawback of this approach is that the number of partitions is set when the data store is first initialized and can’t be easily changed after that. Getting the number of partitions wrong can be problematic — too many partitions add overhead and decrease the data store’s performance, while too few partitions limit the data store’s scalability.
预先创建分区的另一种方法是按需创建分区。实现动态分区的一种方法是从单个分区开始。当它增长到超过一定大小或变得太热时,它会被分成两个子分区,每个子分区包含大约一半的数据。然后,一个子分区被转移到一个新节点。类似地,如果两个相邻的分区变得足够小,它们可以合并为一个分区。
An alternative to creating partitions upfront is to create them on-demand. One way to implement dynamic partitioning is to start with a single partition. When it grows above a certain size or becomes too hot, it’s split into two sub-partitions, each containing approximately half of the data. Then, one sub-partition is transferred to a new node. Similarly, if two adjacent partitions become small enough, they can be merged into a single one.
在系统中引入分区会增加相当多的复杂性,即使它看起来很简单。分区不平衡很容易成为令人头痛的问题,因为单个热分区可能会成为系统的瓶颈并限制其扩展能力。由于每个分区都独立于其他分区,因此需要事务以原子方式更新多个分区。
Introducing partitions in the system adds a fair amount of complexity, even if it appears deceptively simple. Partition imbalance can easily become a headache as a single hot partition can bottleneck the system and limit its ability to scale. And as each partition is independent of the others, transactions are required to update multiple partitions atomically.
我们仅仅触及了这个主题的表面;如果您有兴趣了解更多信息,我建议您阅读Martin Kleppmann 的《设计数据密集型应用程序》 。
We have merely scratched the surface on the topic; if you are interested to learn more about it, I recommend reading Designing Data-Intensive Applications by Martin Kleppmann.
现在是时候换档并深入使用您可以使用的另一个工具来设计水平可扩展的应用程序 - 复制。
Now it’s time to change gears and dive into another tool you have at your disposal to design horizontally scalable applications — duplication.
可以说,为服务添加更多容量的最简单方法是创建更多实例,并采用某种方式路由或平衡对它们的请求。我们的想法是,如果一个实例具有一定的容量,那么 2 个实例的容量应该是该容量的两倍。
Arguably the easiest way to add more capacity to a service is to create more instances of it and have some way of routing, or balancing, requests to them. The thinking is that if one instance has a certain capacity, then 2 instances should have a capacity that is twice that.
只要您考虑到对其依赖项的影响,创建更多服务实例可能是扩展无状态服务的一种快速而廉价的方法。例如,如果每个服务实例都需要访问共享数据存储,最终,数据存储将成为瓶颈,向系统添加更多服务实例只会进一步加剧压力。
Creating more service instances can be a fast and cheap way to scale out a stateless service, as long as you have taken into account the impact on its dependencies. For example, if every service instance needs to access a shared data store, eventually, the data store will become a bottleneck, and adding more service instances to the system will only strain it further.
跨服务器池的请求路由或平衡是由网络负载平衡器实现的。负载均衡器(LB) 具有映射到一个或多个虚拟 IP (VIP) 地址的一个或多个物理网络接口卡(NIC)。VIP 又与服务器池相关联。LB 充当客户端和服务器之间的中间人 - 客户端只能看到 LB 公开的 VIP,而看不到与其关联的各个服务器。
The routing, or balancing, of requests across a pool of servers is implemented by a network load balancer. A load balancer (LB) has one or more physical network interface cards (NIC) mapped to one or more virtual IP (VIP) addresses. A VIP, in turn, is associated with a pool of servers. The LB acts as a middle-man between clients and servers — the clients only see the VIP exposed by the LB and have no visibility of the individual servers associated with it.
跨服务器分发请求有很多好处。由于客户端与服务器解耦,不需要知道各自的地址,因此负载均衡器后面的服务器数量可以透明地增加或减少。由于多个冗余服务器可以互换地用于处理请求,因此负载均衡器可以检测到故障服务器并将它们从池中取出,从而提高服务的可用性。
Distributing requests across servers has many benefits. Because clients are decoupled from servers and don’t need to know their individual addresses, the number of servers behind the LB can be increased or reduced transparently. And since multiple redundant servers can interchangeably be used to handle requests, a LB can detect faulty ones and take them out of the pool, increasing the service’s availability.
在较高层面上,LB 支持负载均衡之外的多项核心功能,例如服务发现和运行状况检查。
At a high level, a LB supports several core features beyond load balancing, like service discovery and health-checks.
负载均衡
Load Balancing
用于路由请求的算法可以从简单的循环算法到考虑服务器负载和运行状况的更复杂的算法。LB 有多种方式来推断服务器的负载。例如,LB 可以定期访问每个服务器的专用负载端点,该端点返回服务器繁忙程度的度量(例如,CPU 使用率)。不过,持续访问服务器的成本可能非常高,因此负载均衡器通常会缓存这些测量值一段时间。
The algorithms used for routing requests can vary from simple round-robin to more complex ones that take into account the servers’ load and health. There are several ways for a LB to infer the load of the servers. For example, the LB could periodically hit a dedicated load endpoint of each server that returns a measure of how busy the server is (e.g., CPU usage). Hitting the servers constantly can be very costly though, so typically a LB caches these measures for some time.
使用缓存的、因此延迟的指标将请求分发到服务器可能会产生羊群效应。假设定期刷新负载指标,并且刚刚加入池的服务器报告负载为 0 — 猜猜接下来会发生什么?负载均衡器将对该服务器进行重击,直到下次对其负载进行采样。发生这种情况时,服务器会被标记为繁忙,并且负载均衡器会停止向其发送更多请求,假设服务器并未由于发送的请求量而首先变得不可用。这会产生一种乒乓效应,即服务器在非常繁忙和根本不繁忙之间交替。
Using cached, and hence delayed, metrics to distribute requests to servers can create a herding effect. Suppose the load metrics are refreshed periodically, and a server that just joined the pool reported a load of 0 — guess what happens next? The LB is going to hammer that server until the next time its load is sampled. When that happens, the server is marked as busy, and the LB stops sending more requests to it, assuming it hasn’t become unavailable first due to the volume of requests sent its way. This creates a ping-pong effect where the server alternates between being very busy and not busy at all.
由于这种羊群效应,事实证明,在不考虑服务器负载的情况下将请求随机分配到服务器实际上可以实现更好的负载分配。这是否意味着使用延迟负载指标进行负载平衡是不可能的?
Because of this herding effect, it turns out that randomly distributing requests to servers without accounting for their load actually achieves a better load distribution. Does that mean that load balancing using delayed load metrics is not possible?
实际上,有一种方法,但它需要将负载指标与随机性的力量结合起来。这个想法是从池中随机挑选两台服务器,并将请求路由到两台服务器中负载最小的一台。这种方法非常有效,因为它结合了延迟加载信息和随机性提供的针对羊群效应的保护。
Actually, there is a way, but it requires combining load metrics with the power of randomness. The idea is to randomly pick two servers from the pool and route the request to the least-loaded one of the two. This approach works remarkably well as it combines delayed load information with the protection against herding that randomness provides.
服务发现
Service Discovery
服务发现是负载均衡器用来发现池中可将请求路由到的可用服务器的机制。有多种方法可以实现它。例如,一种简单的方法是使用列出所有服务器的 IP 地址的静态配置文件。然而,管理和保持最新状态是相当痛苦的。可以通过 DNS 来实现更灵活的解决方案。最后,使用数据存储提供了最大的灵活性,但代价是增加了系统的复杂性。
Service discovery is the mechanism used by the LB to discover the available servers in the pool it can route requests to. There are various ways to implement it. For example, a simple approach is to use a static configuration file that lists the IP addresses of all the servers. However, this is quite painful to manage and keep up-to-date. A more flexible solution can be implemented with DNS. Finally, using a data store provides the maximum flexibility at the cost of increasing the system’s complexity.
使用动态服务发现机制的好处之一是可以随时在 LB 池中添加和删除服务器。这是云提供商用来实现自动扩展的一项重要功能,即根据负载自动启动和关闭服务器的能力。
One of the benefits of using a dynamic service discovery mechanism is that servers can be added and removed from the LB’s pool at any time. This is a crucial functionality that cloud providers leverage to implement autoscaling, i.e., the ability to automatically spin up and tear down servers based on their load.
健康检查
Health checks
LB 使用运行状况检查来检测服务器何时无法再服务请求并需要暂时从池中删除。健康检查基本上分为两类:被动检查和主动检查。
Health checks are used by the LB to detect when a server can no longer serve requests and needs to be temporarily removed from the pool. There are fundamentally two categories of health checks: passive and active.
负载均衡器在将传入请求路由到下游服务器时执行被动运行状况检查。如果服务器不可访问、请求超时或服务器返回不可重试的状态代码(例如503),负载均衡器可以决定将该服务器从池中删除。
A passive health check is performed by the LB as it routes incoming requests to the servers downstream. If a server isn’t reachable, the request times out, or the server returns a non-retriable status code (e.g., 503), the LB can decide to take that server out from the pool.
相反,主动健康检查需要下游服务器的支持,下游服务器需要公开一个健康端点来指示服务器的健康状态。在本书的后面,我们将更详细地描述如何实现这样的健康端点。
Instead, an active health check requires support from the downstream servers, which need to expose a health endpoint signaling the server’s health state. Later in the book, we will describe in greater detail how to implement such a health endpoint.
现在我们知道了负载均衡器的工作是什么,让我们仔细看看它是如何实现的。鉴于有大量可用的现成解决方案,您可能不必构建自己的负载均衡器,但了解负载均衡工作原理的基本知识至关重要。LB 故障对于服务的客户端来说非常明显,因为它们往往表现为超时和连接重置。由于负载均衡位于您的服务与其客户端之间,因此它也会导致请求-响应事务的端到端延迟。
Now that we know what a load balancer’s job is, let’s take a closer look at how it can be implemented. While you probably won’t have to build your own LB given the plethora of off-the-shelf solutions available, a basic knowledge of how load balancing works is crucial. LB failures are very visible to your services’ clients since they tend to manifest themselves as timeouts and connection resets. Because the LB sits between your service and its clients, it also contributes to the end-to-end latency of request-response transactions.
最基本的负载平衡形式可以通过 DNS 来实现。假设您有几台服务器想要对请求进行负载平衡。如果这些服务器具有可公开访问的 IP 地址,您可以将它们添加到服务的 DNS 记录中,并让客户端在解析 DNS 地址时选择一个,如图14.1所示。
The most basic form of load balancing can be implemented with DNS. Suppose you have a couple of servers that you would like to load balance requests over. If these servers have publicly-reachable IP addresses, you can add those to the service’s DNS record and have the clients pick one when resolving the DNS address, as shown in Figure 14.1.
图 14.1:DNS 负载平衡
Figure 14.1: DNS load balancing
虽然这有效,但它不能很好地处理失败。如果两台服务器中的一台出现故障,DNS 服务器将愉快地继续提供其 IP 地址,而不会意识到故障。您可以手动重新配置 DNS 记录以删除有问题的 IP,但正如我们在第4章中了解到的,由于 DNS 缓存的性质,更改不会立即应用。
Although this works, it doesn’t deal well with failures. If one of the two servers goes down, the DNS server will happily continue serving its IP address unaware of the failure. You can manually reconfigure the DNS record to take out the problematic IP, but as we have learned in chapter 4, changes are not applied immediately due to the nature of DNS caching.
更灵活的负载平衡解决方案可以通过在网络堆栈1的TCP级别上运行的负载平衡器来实现,客户端和服务器之间的所有流量都需要经过该负载平衡器。
A more flexible load balancing solution can be implemented with a load balancer that operates at the TCP level of the network stack1, through which all traffic between clients and servers needs to go through.
当客户端与负载均衡器的 VIP 创建新的 TCP 连接时,负载均衡器会从池中选择一台服务器,并从此在客户端和服务器之间的连接中来回混洗数据包。那么负载均衡器如何分配与服务器的连接呢?
When a client creates a new TCP connection with a LB’s VIP, the LB picks a server from the pool and henceforth shuffles the packets back and forth for that connection between the client and the server. How does the LB assign connections to the servers, though?
连接由元组(源 IP/端口、目标 IP/端口)标识。通常,某种形式的散列用于将连接元组分配给服务器。为了最大限度地减少因在池中添加或删除服务器而造成的中断,一致哈希优于模块化哈希。
A connection is identified by a tuple (source IP/port, destination IP/port). Typically, some form of hashing is used to assign a connection tuple to a server. To minimize the disruption caused by a server being added or removed from the pool, consistent hashing is preferred over modular hashing.
为了向下游转发数据包,LB将每个数据包的源地址转换为 LB 地址,并将其目标地址转换为服务器地址。类似地,当LB接收到来自服务器的数据包时,它会将其源地址转换为LB地址,将其目标地址转换为客户端地址(见图14.2 )。
To forward packets downstream, the LB translates each packet’s source address to the LB address and its destination address to the server’s address. Similarly, when the LB receives a packet from the server, it translates its source address to the LB address and its destination address to the client’s address (see Figure 14.2).
图14.2:传输层负载均衡
Figure 14.2: Transport layer load balancing
由于服务器传出的数据量通常比传入的数据量大,因此服务器有一种方法可以绕过负载均衡,使用一种称为“服务器直接返回”的机制直接响应客户端,但这超出了本文的范围。本节。
As the data going out of the servers usually has a greater volume than the data coming in, there is a way for servers to bypass the LB and respond directly to the clients using a mechanism called direct server return, but this is beyond the scope of this section.
由于LB直接与服务器通信,因此它可以检测不可用的服务器(例如,通过被动健康检查)并自动将它们从池中取出,从而提高后端服务的可靠性。
Because the LB is communicating directly with the servers, it can detect unavailable ones (e.g., with a passive health check) and automatically take them out of the pool improving the reliability of the backend service.
虽然 TCP 级别的负载平衡连接速度非常快,但缺点是 LB 只是对字节进行洗牌,而不知道它们的实际含义。因此,L4 LB 通常不支持需要更高级别网络协议的功能,例如终止 TLS 连接或基于 cookie 平衡 HTTP 会话。需要在网络堆栈的更高级别上运行的负载均衡器来支持这些高级用例。
Although load balancing connections at the TCP level is very fast, the drawback is that the LB is just shuffling bytes around without knowing what they actually mean. Therefore, L4 LBs generally don’t support features that require higher-level network protocols, like terminating TLS connections or balancing HTTP sessions based on cookies. A load balancer that operates at a higher level of the network stack is required to support these advanced use cases.
应用层负载均衡器2是一个 HTTP 反向代理,它通过服务器池分出请求。LB接收来自客户端的HTTP请求,对其进行检查并将其发送到后端服务器。
An application layer load balancer2 is an HTTP reverse proxy that farms out requests over a pool of servers. The LB receives an HTTP request from a client, inspects it, and sends it to a backend server.
这里有两种不同的 TCP 连接,一种是客户端和 L7 LB 之间的连接,另一种是 L7 LB 和服务器之间的连接。由于 L7 LB 在 HTTP 级别运行,因此它可以对共享同一 TCP 连接的各个 HTTP 请求进行多路分解。这对于 HTTP 2 来说更为重要,其中多个并发流在同一 TCP 连接上复用,并且某些连接的处理成本可能比其他连接高几个数量级。
There are two different TCP connections at play here, one between the client and the L7 LB and another between the L7 LB and the server. Because a L7 LB operates at the HTTP level, it can de-multiplex individual HTTP requests sharing the same TCP connection. This is even more important with HTTP 2, where multiple concurrent streams are multiplexed on the same TCP connection, and some connections can be several orders of magnitude more expensive to handle than others.
LB 可以对应用程序流量执行智能操作,例如基于 HTTP 标头限制请求速率、终止 TLS 连接或强制将属于同一逻辑会话的 HTTP 请求路由到同一后端服务器。
The LB can do smart things with application traffic, like rate-limiting requests based on HTTP headers, terminate TLS connections, or force HTTP requests belonging to the same logical session to be routed to the same backend server.
例如,LB可以使用特定的cookie来识别特定请求属于哪个逻辑会话。就像 L4 LB 一样,可以使用一致哈希将会话标识符映射到服务器。需要注意的是,粘性会话可能会产生热点,因为某些会话的处理成本比其他会话更高。
For example, the LB could use a specific cookie to identify which logical session a specific request belongs to. Just like with a L4 LB, the session identifier can be mapped to a server using consistent hashing. The caveat is that sticky sessions can create hotspots as some sessions are more expensive to handle than others.
如果听起来 L7 LB 与 API 网关有一些重叠的功能,那是因为它们都是 HTTP 代理,因此它们的职责可能会模糊。
If it sounds like a L7 LB has some overlapping functionality with an API gateway, it’s because they both are HTTP proxies, and therefore their responsibilities can be blurred.
L7 LB 通常用作 L4 LB 的后端,以对外部客户端从 Internet 发送的请求进行负载平衡(参见图14.3)。尽管 L7 LB 比 L4 LB 提供更多功能,但相比之下,它们的吞吐量较低,这使得 L4 LB 更适合防御某些 DDoS 攻击,例如 SYN 洪水。
A L7 LB is typically used as the backend of a L4 LB to load balance requests sent by external clients from the internet (see Figure 14.3). Although L7 LBs offer more functionality than L4 LBs, they have a lower throughput in comparison, which makes L4 LBs better suited to protect against certain DDoS attacks, like SYN floods.
图 14.3:L7 LB 通常用作 L4 LB 的后端,以对外部客户端从互联网发送的请求进行负载平衡。
Figure 14.3: A L7 LB is typically used as the backend of a L4 one to load balance requests sent by external clients from the internet.
使用专用负载均衡服务的一个缺点是,所有流量都需要经过它,如果 LB 出现故障,则无法再访问其背后的服务。此外,它是又一项需要运营和扩展的服务。
A drawback of using a dedicated load-balancing service is that all the traffic needs to go through it and if the LB goes down, the service behind it is no longer reachable. Additionally, it’s one more service that needs to be operated and scaled out.
当客户端位于组织内部时,L7 LB 功能也可以使用sidecar 模式直接连接到客户端。在此模式中,来自客户端的所有网络流量都经过位于同一计算机上的进程。这个过程实现了负载平衡、速率限制、身份验证、监控和其他功能。
When the clients are internal to an organization, the L7 LB functionality can alternatively be bolted onto the clients directly using the sidecar pattern. In this pattern, all network traffic from a client goes through a process co-located on the same machine. This process implements load balancing, rate-limiting, authentication, monitoring, and other goodies.
sidecar 进程形成服务网格的数据平面,由相应的控制平面配置。随着微服务在具有数百个服务相互通信的组织中的兴起,这种方法越来越受欢迎。截至撰写本文时,流行的 sidecar 代理负载均衡器有 NGINX、HAProxy 和 Envoy。使用这种方法的优点在于,它将负载平衡功能分配给客户端,从而无需需要横向扩展和维护的专用服务。缺点是系统复杂性显着增加。
The sidecar processes form the data plane of a service mesh, which is configured by a corresponding control plane. This approach has been gaining popularity with the rise of microservices in organizations that have hundreds of services communicating with each other. Popular sidecar proxy load balancers as of this writing are NGINX, HAProxy, and Envoy. The advantage of using this approach is that it distributes the load-balancing functionality to the clients, removing the need for a dedicated service that needs to be scaled out and maintained. The con is a significant increase in the system’s complexity.
当我们在第2章中第一次讨论 TCP时,我们讨论了最小化客户端和服务器之间的延迟的重要性。无论服务器有多快,如果客户端位于世界的另一端,仅仅因为网络延迟(物理上受光速限制),响应时间就会超过 100 毫秒。更不用说通过公共互联网长距离发送数据时错误率的增加。
When we first discussed TCP in chapter 2, we talked about the importance of minimizing the latency between a client and a server. No matter how fast the server is, if the client is located on the other side of the world from it, the response time is going to be over 100 ms just because of the network latency, which is physically limited by the speed of light. Not to mention the increased error rate when sending data across the public internet over long distances.
为了缓解这些性能问题,您可以将流量分配到位于不同区域的不同数据中心。但是,如何确保客户端与地理位置最近的 L4 负载均衡器进行通信?
To mitigate these performance issues, you can distribute the traffic to different data centers located in different regions. But how do you ensure that the clients communicate with the geographically closest L4 load balancer?
这就是DNS 地理负载平衡的用武之地 - 它是 DNS 的扩展,考虑从其 IP 推断出的客户端位置,并返回地理位置最近的 L4 LB VIP 列表(参见图 14.4 )。LB还需要考虑每个数据中心的容量及其健康状况。
This is where DNS geo load balancing comes in — it’s an extension to DNS that considers the location of the client inferred from its IP, and returns a list of the geographically closest L4 LB VIPs (see Figure 14.4). The LB also needs to take into account the capacity of each data center and its health status.
图 14.4:地理负载平衡根据客户端的 IP 推断客户端的位置
Figure 14.4: Geo load balancing infers the location of the client from its IP
如果负载均衡器后面的服务器是无状态的,则横向扩展就像添加更多服务器一样简单。但当涉及到国家时,就需要某种形式的协调。
If the servers behind a load balancer are stateless, scaling out is as simple as adding more servers. But when there is state involved, some form of coordination is required.
复制是在多个节点中存储相同数据的副本的过程。如果数据是静态的,复制很容易:只需将数据复制到多个节点,在其前面添加一个负载均衡器,就完成了。挑战在于处理动态变化的数据,这需要协调以保持同步。
Replication is the process of storing a copy of the same data in multiple nodes. If the data is static, replication is easy: just copy the data to multiple nodes, add a load balancer in front of it, and you are done. The challenge is dealing with dynamically changing data, which requires coordination to keep it in sync.
复制和分片是经常结合使用的技术,但彼此正交。例如,分布式数据存储可以将其数据分为 N 个分区并将它们分布在 K 个节点上。然后,可以使用 Raft 这样的状态机复制算法将每个分区复制 R 次(见图14.5)。
Replication and sharding are techniques that are often combined, but are orthogonal to each other. For example, a distributed data store can divide its data into N partitions and distribute them over K nodes. Then, a state-machine replication algorithm like Raft can be used to replicate each partition R times (see Figure 14.5).
图 14.5:复制和分区的数据存储。一个节点可以是一个分区的复制领导者,同时可以是另一个分区的复制领导者。
Figure 14.5: A replicated and partitioned data store. A node can be the replication leader for a partition while being a follower for another one.
我们已经在第 10章中讨论了一种复制数据的方法。本节将更广泛但不太详细地研究复制,并探索具有不同权衡的不同方法。为了简单起见,我们假设数据集足够小,可以容纳单个节点,因此不需要分区。
We have already discussed one way of replicating data in chapter 10. This section will take a broader, but less detailed, look at replication and explore different approaches with varying trade-offs. To keep things simple, we will assume that the dataset is small enough to fit on a single node, and therefore no partitioning is needed.
复制数据的最常见方法是单领导者、多追随者/副本方法(见图14.6)。在这种方法中,客户端专门向领导者发送写入,领导者更新其本地状态并将更改复制到追随者。当我们讨论 Raft 复制算法时,我们已经看到了这种实现。
The most common approach to replicate data is the single leader, multiple followers/replicas approach (see Figure 14.6). In this approach, the clients send writes exclusively to the leader, which updates its local state and replicates the change to the followers. We have seen an implementation of this when we discussed the Raft replication algorithm.
图 14.6:单领导者复制
Figure 14.6: Single leader replication
在较高级别上,复制可以完全同步、完全异步或两者的组合进行。
At a high level, the replication can happen either fully synchronously, fully asynchronously, or as a combination of the two.
异步复制
Asynchronous replication
在这种模式下,当leader收到客户端的写请求时,它会异步向follower发出复制请求,并在复制完成之前回复客户端。
In this mode, when the leader receives a write request from a client, it asynchronously sends out requests to the followers to replicate it and replies to the client before the replication has been completed.
虽然这种方法速度很快,但它不是容错的。如果领导者在接受写入后但在将其复制到追随者之前崩溃,会发生什么?在这种情况下,可能会选出一个没有最新更新的新领导者,从而导致数据丢失,这是您可能做出的最糟糕的权衡之一。
Although this approach is fast, it’s not fault-tolerant. What happens if the leader crashes right after accepting a write, but before replicating it to the followers? In this case, a new leader could be elected that doesn’t have the latest updates, leading to data loss, which is one of the worst possible trade-offs you can make.
另一个问题是一致性。部分或所有副本可能看不到成功的写入,因为复制是异步发生的。客户端可能会向领导者发送写入,但随后无法从副本中读取数据,因为副本中尚不存在该数据。唯一的保证是,如果写入停止,最终所有副本将赶上并保持相同(最终一致性)。
The other issue is consistency. A successful write might not be visible by some or all replicas because the replication happens asynchronously. The client could send a write to the leader and later fail to read the data from a replica because it doesn’t exist there yet. The only guarantee is that if the writes stop, eventually, all replicas will catch up and be identical (eventual consistency).
同步复制
Synchronous replication
同步复制等待写入被复制到所有追随者,然后再将响应返回给客户端,这会带来性能损失。如果副本非常慢,则每个请求都会受到它的影响。在极端情况下,如果任何副本发生故障或无法访问,则存储将变得不可用,并且无法再写入任何数据。数据存储的节点越多,出现故障的可能性就越大。
Synchronous replication waits for a write to be replicated to all followers before returning a response to the client, which comes with a performance penalty. If a replica is extremely slow, every request will be affected by it. To the extreme, if any replica is down or not reachable, the store becomes unavailable and it can no longer write any data. The more nodes the data store has, the more likely a fault becomes.
正如您所看到的,完全同步或异步复制是两个极端,它们提供了一些优点,但牺牲了其他优点。大多数数据存储都具有结合使用两者的复制策略。例如,在 Raft 中,领导者在将响应返回给客户端之前将其写入复制给多数人。在 PostgreSQL 中,您可以配置副本子集以同步而不是异步接收更新。
As you can see, fully synchronous or asynchronous replication are extremes that provide some advantages at the expense of others. Most data stores have replication strategies that use a combination of the two. For example, in Raft, the leader replicates its writes to a majority before returning a response to the client. And in PostgreSQL, you can configure a subset of replicas to receive updates synchronously rather than asynchronously.
在多主复制中,有多个节点可以接受写入。当写入吞吐量太高以至于单个节点无法处理时,或者当领导者需要在多个数据中心可用以在地理位置上更接近其客户端时,可以使用此方法。
In multi-leader replication, there is more than one node that can accept writes. This approach is used when the write throughput is too high for a single node to handle, or when a leader needs to be available in multiple data centers to be geographically closer to its clients.
图 14.7:多领导者复制
Figure 14.7: Multi-leader replication
复制是异步发生的,因为替代方案首先会违背使用多个领导者的目的。通常最好尽可能避免这种形式的复制,因为它会带来很多复杂性。多个领导者的主要问题是写入冲突;如果同一个数据项由两个领导者同时更新,哪一个应该获胜?为了解决冲突,数据存储需要实施冲突解决策略。
The replication happens asynchronously since the alternative would defeat the purpose of using multiple leaders in the first place. This form of replication is generally best avoided when possible as it introduces a lot of complexity. The main issue with multiple leaders are conflicting writes; if the same data item is updated concurrently by two leaders, which one should win? To resolve conflicts, the data store needs to implement a conflict resolution strategy.
最简单的策略是设计系统,使冲突不可能发生;如果数据具有归属区域,则在某些情况下可以实现这一点。例如,如果所有欧洲客户请求始终路由到具有单个领导者的欧洲数据中心,则不会出现任何冲突的写入。数据中心仍然有可能出现故障,但可以通过同一区域中的备份数据中心并通过单领导者复制进行复制来缓解这种情况。
The simplest strategy is to design the system so that conflicts are not possible; this can be achieved under some circumstances if the data has a homing region. For example, if all the European customer requests are always routed to the European data center, which has a single leader, there won’t be any conflicting writes. There is still the possibility of a data center going down, but that can be mitigated with a backup data center in the same region, replicated with single-leader replication.
如果不可能将请求分配给特定的领导者,并且每个客户端都需要能够向每个领导者写入数据,那么将不可避免地发生写入冲突。
If assigning requests to specific leaders is not possible, and every client needs to be able to write to every leader, conflicting writes will inevitably happen.
处理更新记录冲突的一种方法是存储并发写入并将它们返回到读取该记录的下一个客户端。客户端将尝试解决冲突并使用解决方案更新数据存储。换句话说,数据存储将“罐子推向”客户。
One way to deal with a conflict updating a record is to store the concurrent writes and return them to the next client that reads the record. The client will try to resolve the conflict and update the data store with the resolution. In other words, the data store “pushes the can down the road” to the clients.
或者,需要实现自动冲突解决方法,例如:
Alternatively, an automatic conflict resolution method needs to be implemented, for example:
如果任何副本都可以接受来自客户端的写入怎么办?在这种情况下,就不会有任何领导者,复制和解决冲突的责任将完全转移给客户。
What if any replica could accept writes from clients? In that case, there wouldn’t be any leader(s), and the responsibility of replicating and resolving conflicts would be offloaded entirely to the clients.
为此,需要满足一个基本的不变量。假设数据存储有 N 个副本。当客户端向副本发送写请求时,它会等待至少 W 个副本确认,然后再继续。当它读取一个条目时,它会查询 R 副本并从响应集中获取最新的副本。现在,只要,写入集和副本集相交,这保证了读取集中至少有一条记录会反映最新的写入。
For this to work, a basic invariant needs to be satisfied. Suppose the data store has N replicas. When a client sends a write request to the replicas, it waits for at least W replicas to acknowledge it before moving on. And when it reads an entry, it does so by querying R replicas and taking the most recent one from the response set. Now, as long as , the write and replica set intersect, which guarantees that at least one record in the read set will reflect the latest write.
写入始终并行发送到所有 N 个副本;W 参数仅确定客户端完成请求所需接收的响应数量。数据存储的读写吞吐量取决于R和W的大小。例如,具有大量读取操作的工作负载可以受益于较小的 R,但反过来,这也会使写入速度变慢且可用性降低。
The writes are always sent to all N replicas in parallel; the W parameter determines just the number of responses the client has to receive to complete the request. The data store’s read and write throughput depend on how large or small R and W are. For example, a workload with many reads benefits from a smaller R, but in turn, that makes writes slower and less available.
与多主复制一样,当同时发生对同一记录的两个或多个写入时,需要使用冲突解决策略。
Like in multi-leader replication, a conflict resolution strategy needs to be used when two or more writes to the same record happen concurrently.
无领导者复制比多领导者复制更复杂,因为它将领导者责任转移给客户端,并且存在影响一致性的边缘情况,即使很满意。例如,如果写入在少于 W 个副本上成功,而在其他副本上失败,则副本将处于不一致状态。
Leaderless replication is even more complex than multi-leader replication, as it’s offloading the leader responsibilities to the clients, and there are edge cases that affect consistency even when is satisfied. For example, if a write succeeded on less than W replicas and failed on the others, the replicas are left in an inconsistent state.
现在让我们看一下一种非常特殊的复制类型,它只提供尽力而为的保证:缓存。
Let’s take a look now at a very specific type of replication that only offers best effort guarantees: caching.
假设服务需要从远程依赖项(例如数据存储)检索数据来处理其请求。随着服务扩展,依赖项也需要做同样的事情来跟上不断增加的负载。可以引入缓存来减少依赖的负载,提高访问数据的性能。
Suppose a service requires retrieving data from a remote dependency, like a data store, to handle its requests. As the service scales out, the dependency needs to do the same to keep up with the ever-increasing load. A cache can be introduced to reduce the load on the dependency and improve the performance of accessing the data.
缓存是一个高速存储层,它临时缓冲来自下游依赖项的响应,以便可以直接从它提供未来的请求 - 这是尽力复制的一种形式。为了使缓存具有成本效益,应该有很高的概率可以在其中找到所请求的数据。这要求数据访问模式具有较高的引用局部性,例如随着时间的推移一次又一次访问相同数据的可能性很高。
A cache is a high-speed storage layer that temporarily buffers responses from downstream dependencies so that future requests can be served directly from it — it’s a form of best effort replication. For a cache to be cost-effective, there should be a high probability that requested data can be found in it. This requires the data access pattern to have a high locality of reference, like a high likelihood of accessing the same data again and again over time.
当发生缓存未命中时3,必须从远程依赖项请求丢失的数据项,并且必须用它来更新缓存。这可以通过两种方式发生:
When a cache miss occurs3, the missing data item has to be requested from the remote dependency, and the cache has to be updated with it. This can happen in two ways:
由于缓存具有保存条目的最大容量,因此当达到其容量时,需要逐出条目以便为新条目腾出空间。要删除哪个条目取决于缓存使用的逐出策略和客户端的访问模式。一种常用的策略是逐出最近最少使用的 (LRU) 条目。
Because a cache has a maximum capacity for holding entries, an entry needs to be evicted to make room for a new one when its capacity is reached. Which entry to remove depends on the eviction policy used by the cache and the client’s access pattern. One commonly used policy is to evict the least recently used (LRU) entry.
缓存还有一个过期策略,规定存储条目的时间。例如,简单的过期策略定义了最大生存时间 (TTL)(以秒为单位)。当数据项在缓存中的时间超过其 TTL 时,它就会过期并可以安全地被逐出。
A cache also has an expiration policy that dictates for how long to store an entry. For example, a simple expiration policy defines the maximum time to live (TTL) in seconds. When a data item has been in the cache for longer than its TTL, it expires and can safely be evicted.
不过,到期不需要立即发生,并且可以推迟到下次请求条目时。事实上,这可能更好——如果依赖项暂时不可用,并且缓存是内联的,它可以选择向客户端返回具有过期 TTL 的条目而不是错误。
The expiration doesn’t need to occur immediately, though, and it can be deferred to the next time the entry is requested. In fact, that might be preferable — if the dependency is temporarily unavailable, and the cache is inline, it can opt to return an entry with an expired TTL to the client rather than an error.
您可以构建的最简单的缓存是位于客户端内的内存中字典,例如大小有限且受节点提供的可用内存限制的哈希表。
The simplest possible cache you can build is an in-memory dictionary located within the clients, such as a hash-table with a limited size and bounded to the available memory that the node offers.
图 14.8:进程内缓存
Figure 14.8: In-process cache
由于每个缓存完全独立于其他缓存,因此一致性问题是不可避免的,因为每个客户端可能会看到同一条目的不同版本。此外,每个缓存都需要获取一次条目,从而产生与客户端数量成比例的下游压力。
Because each cache is completely independent of the others, consistency issues are inevitable since each client potentially sees a different version of the same entry. Additionally, an entry needs to be fetched once per cache, creating downstream pressure proportional to the number of clients.
当具有进程内缓存的服务重新启动或横向扩展时,此问题会加剧,并且每个新启动的实例都需要直接从依赖项中获取条目。这可能会导致“惊群”效应,其中下游依赖项受到请求激增的影响。如果以前未访问过的特定数据项突然变得非常受欢迎,那么在运行时也会发生同样的情况。
This issue is exacerbated when a service with an in-process cache is restarted or scales out, and every newly started instance requires to fetch entries directly from the dependency. This can cause a “thundering herd” effect where the downstream dependency is hit with a spike of requests. The same can happen at run-time if a specific data item that wasn’t accessed before becomes all of a sudden very popular.
请求合并可用于减少雷群的影响。这个想法是,每次进程内缓存最多应该有一个未完成的请求来获取特定的数据项。例如,如果服务实例正在服务 10 个并发请求,这些请求需要缓存中尚未存在的特定记录,则该实例将仅向远程依赖项发送一个请求以获取丢失的条目。
Request coalescing can be used to reduce the impact of a thundering herd. The idea is that there should be at most one outstanding request at the time to fetch a specific data item per in-process cache. For example, if a service instance is serving 10 concurrent requests requiring a specific record that is not yet in the cache, the instance will send only a single request out to the remote dependency to fetch the missing entry.
跨所有服务实例共享的外部缓存解决了使用进程内缓存的一些缺点,但代价是增加了复杂性和成本。
An external cache, shared across all service instances, addresses some of the drawbacks of using an in-process cache at the expense of greater complexity and cost.
由于外部缓存在服务实例之间共享,因此在任何给定时间每个数据项只能有一个版本。尽管缓存的项目可能已经过时,但访问缓存的每个客户端都会看到相同的版本,这减少了一致性问题。由于访问条目的次数不再随着客户端数量的增加而增加,因此依赖项的负载也减少了。
Because the external cache is shared among the service instances, there can be only a single version of each data item at any given time. And although the cached item can be out-of-date, every client accessing the cache will see the same version, which reduces consistency issues. The load on the dependency is also reduced since the number of times an entry is accessed no longer grows as the number of clients increases.
图 14.9:进程外缓存
Figure 14.9: Out-of-process cache
尽管我们已经设法将客户端与依赖关系解耦,但我们只是将负载转移到外部缓存。如果负载增加,缓存最终需要横向扩展。当发生这种情况时,应移动尽可能少的数据,以保证缓存的可用性不会下降,并且缓存未命中的数量不会显着增加。一致散列或类似的分区技术可用于减少需要移动的数据量。
Although we have managed to decouple the clients from the dependency, we have merely shifted the load to the external cache. If the load increases, the cache will eventually need to be scaled out. As little data as possible should be moved around when that happens to guarantee that the cache’s availability doesn’t drop and that the number of cache misses doesn’t significantly increase. Consistent hashing, or a similar partitioning technique, can be used to reduce the amount of data that needs to be moved around.
维护外部缓存是有代价的,因为它是另一项需要维护和操作的服务。此外,访问它的延迟高于访问进程内缓存,因为需要网络调用。
Maintaining an external cache comes with a price as it’s yet another service that needs to be maintained and operated. Additionally, the latency to access it is higher than accessing an in-process cache because a network call is required.
如果外部缓存发生故障,服务应如何反应?您可能会认为暂时绕过缓存并直接命中依赖项可能没问题。但在这种情况下,依赖项可能无法承受流量激增,因为它通常受到缓存的屏蔽。因此,外部缓存变得不可用很容易导致级联故障,导致依赖项也变得不可用。
If the external cache is down, how should the service react? You would think it might be okay to temporarily bypass the cache and directly hit the dependency. But in that case, the dependency might not be prepared to withstand a surge of traffic since it’s usually shielded by the cache. Consequently, the external cache becoming unavailable could easily cause a cascading failure resulting in the dependency to become unavailable as well.
客户端可以利用进程内缓存来防御外部缓存变得不可用。同样,依赖也需要做好应对这些突然“攻击”的准备。减载是一种可以在这里使用的技术,我们将在本书后面讨论。
The clients can leverage an in-process cache as a defense against the external cache becoming unavailable. Similarly, the dependency also needs to be prepared to handle these sudden “attacks.” Load shedding is a technique that can be used here, which we will discuss later in the book.
需要理解的重要一点是,缓存在系统中引入了双模式行为4。大多数时候,缓存按预期工作,一切都很好;当不是出于任何原因时,系统需要在没有它的情况下生存。如果您的系统在没有缓存的情况下根本无法应对,那么这就是一种设计味道。
What’s important to understand is that a cache introduces a bi-modal behavior in the system4. Most of the time, the cache is working as expected, and everything is fine; when it’s not for whatever reason, the system needs to survive without it. It’s a design smell if your system can’t cope at all without a cache.
这也称为第 4 层 (L4) 负载平衡,因为第 4 层是 OSI 模型中的传输层。↩︎
This is also referred to as layer 4 (L4) load balancing since layer 4 is the transport layer in the OSI model.↩︎
也称为第 7 层 (L7) 负载均衡器,因为第 7 层是 OSI 模型中的应用程序层↩︎
Also referred to as a layer 7 (L7) load balancer since layer 7 is the application layer in the OSI model↩︎
当所请求的数据可以在缓存中找到时,发生缓存命中;而当无法在缓存中找到时,发生缓存未命中。↩︎
A cache hit occurs when the requested data can be found in the cache, while a cache miss occurs when it cannot.↩︎
还记得我们在12.4节中讨论过消息通道的双模式行为吗?正如我们稍后将了解到的,您总是希望最大限度地减少应用程序中的模式数量,以使它们易于理解和操作。↩︎
Remember when we talked about the bi-modal behavior of message channels in section 12.4? As we will learn later, you always want to minimize the number of modes in your applications to make them simple to understand and operate.↩︎
当您扩展应用程序时,任何可能发生的故障最终都会发生。硬件故障、软件崩溃、内存泄漏——凡是你能想到的。您拥有的组件越多,您遇到的故障就越多。
As you scale out your applications, any failure that can happen will eventually happen. Hardware failures, software crashes, memory leaks — you name it. The more components you have, the more failures you will experience.
假设您有一个有缺陷的服务,平均每一百个请求就会泄漏 1 MB 内存。如果服务每天执行一千个请求,您很可能会在泄漏达到任何显着规模之前重新启动服务以部署新版本。但是,如果您的服务每天执行 1000 万个请求,那么到一天结束时您会丢失 100 GB 内存!最终,服务器将没有足够的可用内存,并且由于不断地将页面换入和换出磁盘而开始产生垃圾。
Suppose you have a buggy service that leaks 1 MB of memory on average every hundred requests. If the service does a thousand requests per day, chances are you will restart the service to deploy a new build before the leak reaches any significant size. But if your service is doing 10 million requests per day, then by the end of the day you lose 100 GB of memory! Eventually, the servers won’t have enough memory available and they will start to trash due to the constant swapping of pages in and out from disk.
这种令人讨厌的行为是由残酷的数学造成的;给定一个有一定失败概率的操作,失败的总数随着执行的操作总数的增加而增加。换句话说,系统扩展得越多以处理更多的负载,并且操作和移动部件越多,系统遇到的故障就越多。
This nasty behavior is caused by cruel math; given an operation that has a certain probability of failing, the total number of failures increases with the total number of operations performed. In other words, the more you scale out your system to handle more load, and the more operations and moving parts there are, the more failures your systems will experience.
还记得我们在第1章中讨论过可用性和“9”吗?好吧,为了保证只有两个 9,您的系统每天最多可以有 15 分钟不可用。采取任何手动操作的时间很少。如果你努力争取 3 个 9,那么你每月只有 43 分钟可用时间。尽管您无法逃避残酷的数学计算,但您可以通过实施自我修复机制来减轻故障的影响。
Remember when we talked about availability and “nines” in chapter 1? Well, to guarantee just two nines, your system can be unavailable for up to 15 min a day. That’s very little time to take any manual action. If you strive for 3 nines, then you only have 43 minutes per month available. Although you can’t escape cruel math, you can mitigate it by implementing self-healing mechanisms to reduce the impact of failures.
第15章描述了分布式系统中最常见故障的原因:单点故障、不可靠的网络、缓慢的进程和意外的负载。
Chapter 15 describes the causes of the most common failures in distributed systems: single points of failure, unreliable networks, slow processes, and unexpected load.
第16章深入探讨了弹性模式,这些模式有助于保护服务免受下游依赖项故障的影响,例如超时、重试和断路器。
Chapter 16 dives into resiliency patterns that help shield a service against failures in downstream dependencies, like timeouts, retries, and circuit breakers.
第17章讨论了帮助保护服务免受上游压力的弹性模式,例如负载卸载、负载均衡和速率限制。
Chapter 17 discusses resiliency patterns that help protect a service against upstream pressure, like load shedding, load leveling, and rate-limiting.
为了保护您的系统免受故障影响,您首先需要了解可能出现问题的原因。您遇到的最常见的故障是由单点故障、网络不可靠、进程缓慢和意外负载引起的。让我们仔细看看这些。
In order to protect your systems against failures, you first need to have an idea of what can go wrong. The most common failures you will encounter are caused by single points of failure, the network being unreliable, slow processes, and unexpected load. Let’s take a closer look at these.
单点故障是分布式系统中最明显的故障原因;如果它发生故障,这个组件就会导致整个系统崩溃。在实践中,系统可能存在多个单点故障。
A single point of failure is the most glaring cause of failure in a distributed system; if it were to fail, that one component would bring down the entire system with it. In practice, systems can have multiple single points of failure.
通过需要从非复制数据库读取配置来启动的服务就是单点故障的一个示例;如果数据库无法访问,服务将无法(重新)启动。一个更微妙的示例是使用需要手动更新的证书在 TLS 之上公开 HTTP API 的服务。如果证书在过期时未续订,则所有尝试连接到该证书的客户端将无法打开与该服务的连接。
A service that starts up by needing to read a configuration from a non-replicated database is an example of a single point of failure; if the database isn’t reachable, the service won’t be able to (re)start. A more subtle example is a service that exposes an HTTP API on top of TLS using a certificate that needs to be manually renewed. If the certificate isn’t renewed by the time it expires, then all clients trying to connect to it wouldn’t be able to open a connection with the service.
在构建系统时应识别单点故障,以免造成任何损害。检测它们的最佳方法是检查系统的每个组件,并询问如果该组件发生故障会发生什么。一些单点故障可以通过架构消除,例如通过引入冗余,而另一些则不能。在这种情况下,唯一的选择就是最小化爆炸半径。
Single points of failure should be identified when the system is architected before they can cause any harm. The best way to detect them is to examine every component of the system and ask what would happen if that component were to fail. Some single points of failure can be architected away, e.g., by introducing redundancy, while others can’t. In that case, the only option left is to minimize the blast radius.
当客户端进行远程网络调用时,它会向服务器发送请求,并期望稍后收到服务器的响应。在最好的情况下,客户端在发送请求后不久就会收到响应。但如果客户端等了又等,仍然没有得到响应怎么办?在这种情况下,客户端不知道响应最终是否会到达。此时它只有两个选择:它可以继续等待,或者因异常或错误而使请求失败。
When a client makes a remote network call, it sends a request to a server and expects to receive a response from it a while later. In the best case, the client receives a response shortly after sending the request. But what if the client waits and waits and still doesn’t get a response? In that case, the client doesn’t know whether a response will eventually arrive or not. At that point it has only two options: it can either continue to wait, or fail the request with an exception or error.
正如第7章介绍故障检测概念时所讨论的,客户端至今未收到响应有以下几个原因:
As discussed when the concept of failure detection was introduced in chapter 7, there are several reasons why the client hasn’t received a response so far:
缓慢的网络调用是分布式系统的无声杀手。因为客户端不知道响应是否正在路上,所以如果它根本放弃的话,它可能会在放弃之前花费很长时间等待。等待反过来会导致性能下降,并且极难调试。在第 16章中,我们将探讨保护客户端免受网络不可靠性影响的方法。
Slow network calls are the silent killers of distributed systems. Because the client doesn’t know whether the response is on its way or not, it can spend a long time waiting before giving up, if it gives up at all. The wait can in turn cause degradations that are extremely hard to debug. In chapter 16 we will explore ways to protect clients from the unreliability of the network.
从观察者的角度来看,一个非常慢的进程与根本不运行的进程没有太大区别——都不能执行有用的工作。资源泄漏是进程缓慢的最常见原因之一。每当您使用资源时,尤其是从池中租用资源时,都可能存在泄漏。
From an observer’s point of view, a very slow process is not very different from one that isn’t running at all — neither can perform useful work. Resource leaks are one of the most common causes of slow processes. Whenever you use resources, especially when they have been leased from a pool, there is a potential for leaks.
内存是最著名的泄漏源。内存泄漏会导致内存消耗随着时间的推移而稳定增加。带有垃圾收集的运行时也没有多大帮助;如果对不再需要的对象的引用保留在某处,则垃圾收集器不会删除该对象。
Memory is the most well-known source of leaks. A memory leak causes a steady increase in memory consumption over time. Run-times with garbage collection don’t help much either; if a reference to an object that is no longer needed is kept somewhere, the object won’t be deleted by the garbage collector.
内存泄漏会不断消耗内存,直到不再有内存为止,此时操作系统开始不断地将内存页面交换到磁盘,而垃圾收集器会更频繁地启动,尽力释放任何内存碎片。持续的分页和垃圾收集器占用 CPU 周期使进程变慢。最终,当没有更多的物理内存并且交换文件中没有更多的空间时,进程将无法分配更多的内存,并且大多数操作将失败。
A memory leak keeps consuming memory until there is no more of it, at which point the operating system starts swapping memory pages to disk constantly, while the garbage collector kicks in more frequently trying its best to release any shred of memory. The constant paging and the garbage collector eating up CPU cycles make the process slower. Eventually, when there is no more physical memory, and there is no more space in the swap file, the process won’t be able to allocate more memory, and most operations will fail.
内存只是可能泄漏的众多资源之一。例如,如果您使用线程池,则当线程阻塞于永远不会返回的同步调用时,您可能会丢失线程。如果线程在未设置超时的情况下进行同步阻塞 HTTP 调用,并且该调用永远不会返回,则该线程将不会返回到池中。由于池具有固定的大小并且不断丢失线程,因此它最终会耗尽线程。
Memory is just one of the many resources that can leak. For example, if you are using a thread pool, you can lose a thread when it blocks on a synchronous call that never returns. If a thread makes a synchronous blocking HTTP call without setting a timeout, and the call never returns, the thread won’t be returned to the pool. Since the pool has a fixed size and keeps losing threads, it will eventually run out of threads.
您可能认为进行异步调用(而不是同步调用)对于前一种情况会有所帮助。然而,现代 HTTP 客户端使用套接字池来避免重新创建 TCP 连接并支付高昂的性能费用,如第2章所述。如果在没有超时的情况下发出请求,则连接永远不会返回到池中。由于池的大小有限,最终不会留下任何连接。
You might think that making asynchronous calls, rather than synchronous ones, would help in the previous case. However, modern HTTP clients use socket pools to avoid recreating TCP connections and pay a hefty performance fee as discussed in chapter 2. If a request is made without a timeout, the connection is never returned to the pool. As the pool has a limited size, eventually there won’t be any connections left.
最重要的是,您编写的代码并不是唯一访问内存、线程和套接字的代码。您的应用程序所依赖的库访问相同的资源,并且它们可以执行各种阴暗的事情。如果不深入研究它们的实现,假设它首先是开放的,您就无法确定它们是否会造成严重破坏。
On top of that, the code you write isn’t the only one accessing memory, threads, and sockets. The libraries your application depends on access the same resources, and they can do all kinds of shady things. Without digging into their implementation, assuming it’s open in the first place, you can’t be sure whether they can wreak havoc or not.
每个系统在不扩展的情况下能够承受的负载都有一个限制。根据负载的增加方式,您迟早会撞到砖墙。一件事是负载的有机增加,让您有时间相应地扩展服务,但另一件事是突然且意外的峰值。
Every system has a limit to how much load it can withstand without scaling. Depending on how the load increases, you are bound to hit that brick wall sooner or later. One thing is an organic increase in load that gives you the time to scale out your service accordingly, but another is a sudden and unexpected spike.
例如,考虑某个服务在一段时间内收到的请求数。传入请求的速率和类型可能会随着时间的推移而发生变化,有时甚至会由于多种原因而突然变化:
For example, consider the number of requests received by a service in a period of time. The rate and the type of incoming requests can change over time, and sometimes suddenly, for a variety of reasons:
为了承受意外的负载,需要提前做好准备。第17章中的模式将教您一些如何做到这一点的技巧1。
To withstand unexpected load, you need to prepare beforehand. The patterns in chapter 17 will teach you some techniques on how to do just that1.
您可能会认为,如果您的系统有数百个进程,那么即使一小部分进程缓慢或无法访问,也不会产生太大影响。失败的问题在于,它们往往会像癌症一样扩散,从一个过程传播到另一个过程,直到整个系统崩溃。这种效应也称为级联故障,当整个系统的一部分发生故障时就会发生这种情况,从而增加了其他部分发生故障的可能性。
You would think that if your system has hundreds of processes, it shouldn’t make much difference if a small percentage are slow or unreachable. The thing about failures is that they tend to spread like cancer, propagating from one process to the other until the whole system crumbles to its knees. This effect is also referred to as a cascading failure, which occurs when a portion of an overall system fails, increasing the probability that other portions fail.
例如,假设有多个客户端查询位于负载均衡器后面的两个数据库副本 A 和 B。每个副本每秒处理大约 50 个事务(参见图15.1)。
For example, suppose there are multiple clients querying two database replicas A and B, which are behind a load balancer. Each replica is handling about 50 transactions per second (see Figure 15.1).
图 15.1:LB 后面的两个副本;每个人承担一半的负载。
Figure 15.1: Two replicas behind an LB; each is handling half the load.
突然,副本 B 由于网络故障而变得不可用。负载均衡器检测到 B 不可用并将其从池中删除。因此,副本 A 必须填补副本 B 的空缺,将其之前承受的负载加倍(参见图15.2)。
Suddenly, replica B becomes unavailable because of a network fault. The load balancer detects that B is unavailable and removes it from its pool. Because of that, replica A has to pick up the slack for replica B, doubling the load it was previously under (see Figure 15.2).
图 15.2:当副本 B 不可用时,A 将承受更多负载,这可能会超出其容量。
Figure 15.2: When replica B becomes unavailable, A will be hit with more load, which can strain it beyond its capacity.
随着副本 A 开始努力跟上传入请求,客户端会遇到更多故障和超时。反过来,他们会多次重试相同的失败请求,这雪上加霜。
As replica A starts to struggle to keep up with the incoming requests, the clients experience more failures and timeouts. In turn, they retry the same failing requests several times, adding insult to injury.
最终,副本 A 的负载过大,以至于无法再及时处理请求并变得不可用,从而导致副本 A 从负载均衡器池中删除。与此同时,副本 B 再次可用,负载均衡器将其放回池中,此时它会被立即终止副本的请求淹没。这种厄运的反馈循环可以重复几次。
Eventually, replica A is under so much load that it can no longer serve requests promptly and becomes unavailable, causing replica A to be removed from the load balancer’s pool. In the meantime, replica B becomes available again and the load balancer puts it back in the pool, at which point it’s flooded with requests that kill the replica instantaneously. This feedback loop of doom can repeat several times.
级联故障一旦开始就很难得到控制。缓解这种情况的最好方法就是从一开始就不要使用它。下一章中介绍的模式将帮助您阻止系统中的裂缝蔓延。
Cascading failures are very hard to get under control once they have started. The best way to mitigate one is to not have it in the first place. The patterns introduced in the next chapters will help you stop the cracks in the system from spreading.
正如我们刚才所看到的,分布式系统需要接受失败的发生,并为此做好准备。仅仅因为失败有可能发生并不总是意味着您必须对此采取措施。一天只有这么多小时,您需要做出艰难的决定,决定将工程时间花在哪里。
As we have just seen, a distributed system needs to embrace that failures will happen and needs to be prepared for it. Just because a failure has a chance of happening doesn’t always mean you have necessarily to do something about it. The day has only so many hours, and you will need to make tough decisions about where to spend your engineering time.
考虑到特定的故障,您必须考虑其发生的可能性以及发生时对系统造成的影响。通过将这两个因素相乘,您可以得到一个风险评分,然后您可以使用该评分来决定对哪些故障进行优先级排序并采取行动(参见图15.3)。对于发生可能性大、影响广泛的故障,应迅速处理;另一方面,可能性较低、影响较小的故障可以等待。
Given a specific failure, you have to consider its probability of happening and the impact it causes to your system if it does happen. By multiplying the two factors together, you get a risk score, which you can then use to decide which failures to prioritize and act upon (see Figure 15.3). A failure that is very likely to happen, and has an extensive impact, should be dealt with swiftly; on the other hand, a failure with a low likelihood and low impact can wait.
图 15.3:风险矩阵
Figure 15.3: Risk matrix
为了解决故障,您可以找到一种方法来降低故障发生的可能性,或者减少其影响。
To address a failure, you can either find a way to reduce the probability of it happening, or reduce its impact.
这些技术可能看起来很简单,但非常有效。在 COVID-19 爆发期间,我亲眼目睹了我当时负责的许多系统的流量几乎在一夜之间翻了一番,但没有造成任何事故。↩︎
These techniques might look simple but are very effective. During the COVID-19 outbreak, I have witnessed many of the systems I was responsible for at the time doubling traffic nearly overnight without causing any incidents.↩︎
在本章中,我们将探索保护服务免受下游依赖项故障的模式。
In this chapter, we will explore patterns that shield a service against failures in its downstream dependencies.
当您进行网络调用时,您可以配置超时,如果在一定时间内没有响应,则请求失败。如果您在未设置超时的情况下进行调用,则您告诉代码您 100% 确信调用会成功。你真的愿意接受这个赌注吗?
When you make a network call, you can configure a timeout to fail the request if there is no response within a certain amount of time. If you make the call without setting a timeout, you tell your code that you are 100% confident that the call will succeed. Would you really take that bet?
不幸的是,一些网络 API 一开始就没有办法设置超时。当默认超时为无穷大时,客户端很容易搬起石头砸自己的脚。如前所述,不返回的网络调用最多只会导致资源泄漏。超时可以限制和隔离故障,防止它们级联到系统的其他部分。它们不仅适用于网络调用,还适用于从池中请求资源以及互斥体等同步原语。
Unfortunately, some network APIs don’t have a way to set a timeout in the first place. When the default timeout is infinity, it’s all too easy for a client to shoot itself in the foot. As mentioned earlier, network calls that don’t return lead to resource leaks at best. Timeouts limit and isolate failures, stopping them from cascading to the rest of the system. And they are useful not just for network calls, but also for requesting a resource from a pool and for synchronization primitives like mutexes.
为了让大家明白设置超时的重要性,让我们看一些具体的例子。JavaScript 的XMLHttpRequest是用于从服务器异步检索数据的 Web API 。它的默认超时为零,这意味着没有超时:
To drive the point home on the importance of setting timeouts, let’s take a look at some concrete examples. JavaScript’s XMLHttpRequest is the web API to retrieve data from a server asynchronously. Its default timeout is zero, which means there is no timeout:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/api', true);
// No timeout by default!
xhr.timeout = 10000;
xhr.onload = function () {
// Request finished
};
xhr.ontimeout = function (e) {
// Request timed out
};
xhr.send(null);var xhr = new XMLHttpRequest();
xhr.open('GET', '/api', true);
// No timeout by default!
xhr.timeout = 10000;
xhr.onload = function () {
// Request finished
};
xhr.ontimeout = function (e) {
// Request timed out
};
xhr.send(null);客户端超时与服务器端超时一样重要。浏览器可以为特定主机打开的套接字数量有上限。如果您发出的网络请求永远不会返回,您将耗尽套接字池。当池耗尽时,您将无法再连接到主机。
Client-side timeouts are as crucial as server-side ones. There is a maximum number of sockets your browser can open for a particular host. If you make network requests that never return, you are going to exhaust the socket pool. When the pool is exhausted, you are no longer able to connect to the host.
fetch Web API 是使用 Promises的XMLHttpRequest的现代替代品。最初引入 fetch API 时,根本没有办法设置超时。浏览器最近添加了对Abort API 的实验性支持,以支持超时。
The fetch web API is a modern replacement for XMLHttpRequest that uses Promises. When the fetch API was initially introduced, there was no way to set a timeout at all. Browsers have recently added experimental support for the Abort API to support timeouts.
const controller = new AbortController();
const signal = controller.signal;
const fetchPromise = fetch(url, {signal});
// No timeout by default!
setTimeout(() => controller.abort(), 10000);
fetchPromise.then(response => {
// Request finished
})const controller = new AbortController();
const signal = controller.signal;
const fetchPromise = fetch(url, {signal});
// No timeout by default!
setTimeout(() => controller.abort(), 10000);
fetchPromise.then(response => {
// Request finished
})对于 Python 来说,情况也并没有那么乐观。流行的requests库使用默认的无穷大超时:
Things aren’t much rosier for Python. The popular requests library uses a default timeout of infinity:
# No timeout by default!
response = requests.get('https://github.com/', timeout=10)# No timeout by default!
response = requests.get('https://github.com/', timeout=10)Go 的HTTP 包 默认不使用超时:
Go’s HTTP package doesn’t use timeouts by default, either:
var client = &http.Client{
// No timeout by default!
Timeout: time.Second * 10,
}
response, _ := client .Get(url)var client = &http.Client{
// No timeout by default!
Timeout: time.Second * 10,
}
response, _ := client .Get(url)Java 和 .NET 的现代 HTTP 客户端做得更好,并且通常带有默认超时。例如,.NET Core HttpClient的默认超时为100 秒。虽然比较宽松,但总比不设置超时要好。
Modern HTTP clients for Java and .NET do a much better job and usually come with default timeouts. For example, .NET Core HttpClient has a default timeout of 100 seconds. It’s lax but better than not setting a timeout at all.
根据经验,在进行网络调用时始终设置超时,并警惕进行网络调用或使用内部资源池但不公开超时设置的第三方库。如果您构建库,请始终设置合理的默认超时并为您的客户端配置它们。
As a rule of thumb, always set timeouts when making network calls, and be wary of third-party libraries that do network calls or use internal resource pools but don’t expose settings for timeouts. And if you build libraries, always set reasonable default timeouts and make them configurable for your clients.
理想情况下,您应该根据所需的错误超时率设置超时。假设您希望有大约 0.1% 的错误超时;为此,您应该将超时设置为远程调用响应时间的第 99.9 个百分位数,您可以根据经验进行测量。
Ideally, you should set your timeouts based on the desired false timeout rate. Say you want to have about 0.1% false timeouts; to achieve that, you should set the timeout to the 99.9th percentile of the remote call’s response time, which you can measure empirically.
您还希望有良好的监控来衡量网络调用的整个生命周期,例如调用的持续时间、收到的状态代码以及是否触发了超时。我们将在本书后面讨论监控,但我想在这里指出的一点是,您必须测量系统集成点发生的情况,否则当生产问题出现时您将无法调试它们。
You also want to have good monitoring in place to measure the entire lifecycle of your network calls, like the duration of the call, the status code received, and if a timeout was triggered. We will talk about monitoring later in the book, but the point I want to make here is that you have to measure what happens at the integration points of your systems, or you won’t be able to debug production issues when they show up.
理想情况下,您希望将远程调用封装在一个库中,该库为您设置超时并对其进行监视,这样您就不必在每次进行网络调用时都记住执行此操作。无论您使用哪种语言,都可能有一个库可以实现本章中介绍的一些弹性和瞬态故障处理模式,您可以使用它来封装系统的网络调用。
Ideally, you want to encapsulate a remote call within a library that sets timeouts and monitors it for you so that you don’t have to remember to do this every time you make a network call. No matter which language you use, there is likely a library out there that implements some of the resiliency and transient fault-handling patterns introduced in this chapter, which you can use to encapsulate your system’s network calls.
使用特定于语言的库并不是包装网络调用的唯一方法;您还可以利用位于同一台计算机上的反向代理来拦截您的进程进行的所有远程调用1。代理强制超时并监视调用,从而放弃您的进程这样做的责任。
Using a language-specific library is not the only way to wrap your network calls; you can also leverage a reverse proxy co-located on the same machine which intercepts all the remote calls that your process makes1. The proxy enforces timeouts and also monitors the calls, relinquishing your process from the responsibility to do so.
您现在知道客户端在发出网络请求时应该配置超时。但是,当请求失败或超时时,它应该做什么?此时客户端有两个选择:要么快速失败,要么稍后重试请求。
You know by now that a client should configure a timeout when making a network request. But, what should it do when the request fails, or the timeout fires? The client has two options at that point: it can either fail fast or retry the request at a later time.
如果失败或超时是由短暂的连接问题引起的,则在一段退避时间后重试很有可能成功。然而,如果下游服务不堪重负,立即重试只会让事情变得更糟。这就是为什么重试需要放慢速度,每次重试之间的延迟越来越长,直到达到最大重试次数或自初始请求以来已经过了一定的时间。
If the failure or timeout was caused by a short-lived connectivity issue, then retrying after some backoff time has a high probability of succeeding. However, if the downstream service is overwhelmed, retrying immediately will only make matters worse. This is why retrying needs to be slowed down with increasingly longer delays between the individual retries until either a maximum number of retries is reached or a certain amount of time has passed since the initial request.
要设置重试之间的延迟,您可以使用上限指数函数,其中延迟是通过将初始退避持续时间乘以每次尝试后的常数得出的,最多可达某个最大值(上限):
To set the delay between retries, you can use a capped exponential function, where the delay is derived by multiplying the initial backoff duration by a constant after each attempt, up to some maximum value (the cap):
例如,如果上限设置为 8 秒,并且初始退避持续时间为 2 秒,则第一次重试延迟为 2 秒,第二次为 4 秒,第三次为 8 秒,任何进一步的延迟将被限制为8秒。
For example, if the cap is set to 8 seconds, and the initial backoff duration is 2 seconds, then the first retry delay is 2 seconds, the second is 4 seconds, the third is 8 seconds, and any further delay will be capped to 8 seconds.
虽然指数退避确实减轻了下游依赖的压力,但仍然存在一个问题。当下游服务暂时降级时,多个客户端可能会同时发现其请求失败。这会导致客户端同时重试,从而导致下游服务出现负载峰值,从而进一步降低其性能,如图16.1所示。
Although exponential backoff does reduce the pressure on the downstream dependency, there is still a problem. When the downstream service is temporarily degraded, it’s likely that multiple clients see their requests failing around the same time. This causes the clients to retry simultaneously, hitting the downstream service with load spikes that can further degrade it, as shown in Figure 16.1.
图 16.1:重试风暴
Figure 16.1: Retry storm
为了避免这种羊群行为,您可以在延迟计算中引入随机抖动。有了它,重试会随着时间的推移而分散,从而平滑下游服务的负载:
To avoid this herding behavior, you can introduce random jitter in the delay calculation. With it, the retries spread out over time, smoothing out the load to the downstream service:
主动等待并重试失败的网络请求并不是实现重试的唯一方法。在没有严格实时要求的批处理应用程序中,进程可以将失败的请求放入重试队列中。同一进程或可能是另一个进程稍后从同一队列中读取并重试请求。
Actively waiting and retrying failed network requests isn’t the only way to implement retries. In batch applications that don’t have strict real-time requirements, a process can park failed requests into a retry queue. The same process, or possibly another, reads from the same queue later and retries the requests.
仅仅因为网络调用可以重试并不意味着它应该重试。如果错误不是短暂的,例如,因为进程无权访问远程端点,则重试请求就没有意义,因为它会再次失败。在这种情况下,该过程应该快速失败并立即取消调用。
Just because a network call can be retried doesn’t mean it should be. If the error is not short-lived, for example, because the process is not authorized to access the remote endpoint, then it makes no sense to retry the request since it will fail again. In this case, the process should fail fast and cancel the call right away.
您也不应该重试非幂等的网络调用,其副作用可能会影响应用程序的正确性。假设一个进程正在调用支付提供商服务,并且调用超时;是否应该重试?操作可能已成功,重试会向帐户收取两次费用,除非请求是幂等的。
You should also not retry a network call that isn’t idempotent, and whose side effects can affect your application’s correctness. Suppose a process is making a call to a payment provider service, and the call times out; should it retry or not? The operation might have succeeded and retrying would charge the account twice, unless the request is idempotent.
假设处理来自客户端的请求需要它经历一系列依赖关系。客户端调用服务 A,服务 A 处理与服务 B 对话的请求,服务 B 又与服务 C 对话。
Suppose that handling a request from a client requires it to go through a chain of dependencies. The client makes a call to service A, which to handle the request talks to service B, which in turn talks to service C.
如果服务B向服务C的中间请求失败,B是否应该重试该请求?那么,如果 B 确实重试,A 将感知到其请求的执行时间更长,这反过来又使其更有可能达到 A 的超时。如果发生这种情况,A 会再次重试其请求,从而使客户端更有可能达到超时并重试。
If the intermediate request from service B to service C fails, should B retry the request or not? Well, if B does retry it, A will perceive a longer execution time for its request, which in turn makes it more likely to hit A’s timeout. If that happens, A retries its request again, making it more likely for the client to hit its timeout and retry.
在依赖链的多个级别进行重试可以增加重试次数;服务在链中越深,由于放大而承受的负载就越高(见图16.2)。
Having retries at multiple levels of the dependency chain can amplify the number of retries; the deeper a service is in the chain, the higher the load it will be exposed to due to the amplification (see Figure 16.2).
图 16.2:重试放大操作
Figure 16.2: Retry amplification in action
如果压力变得足够大,这种行为很容易导致整个系统瘫痪。这就是为什么当您有很长的依赖链时,您应该只在链的单个级别重试,并在所有其他级别快速失败。
And if the pressure gets bad enough, this behavior can easily bring down the whole system. That’s why when you have long dependency chains, you should only retry at a single level of the chain, and fail fast in all the other ones.
假设您的服务使用超时来检测与下游依赖项的通信故障,并重试以缓解暂时性故障。如果故障不是暂时的,并且下游依赖项始终无响应,那么应该做什么?如果服务不断重试失败的请求,则客户端的速度必然会变慢。反过来,这种缓慢可能会传播到系统的其余部分并导致级联故障。
Suppose your service uses timeouts to detect communication failures with a downstream dependency, and retries to mitigate transient failures. If the failures aren’t transient and the downstream dependency keeps being unresponsive, what should it do then? If the service keeps retrying failed requests, it will necessarily become slower for its clients. In turn, this slowness can propagate to the rest of the system and cause cascading failures.
为了处理非暂时性故障,我们需要一种机制来检测下游依赖关系的长期退化,并首先阻止向下游发送新请求。毕竟,最快的网络调用是您不必进行的。这种机制也称为断路器,其灵感来自于电路中实现的相同功能。
To deal with non-transient failures, we need a mechanism that detects long-term degradations of downstream dependencies and stops new requests from being sent downstream in the first place. After all, the fastest network call is the one you don’t have to make. This mechanism is also called a circuit breaker, inspired by the same functionality implemented in electrical circuits.
断路器的目标是允许子系统发生故障而不影响整个系统。为了保护系统,对发生故障的子系统的调用将被暂时阻止。随后,当子系统恢复并且故障停止时,断路器允许呼叫再次进行。
A circuit breaker’s goal is to allow a sub-system to fail without bringing down the whole system with it. To protect the system, calls to the failing sub-system are temporarily blocked. Later, when the sub-system recovers and failures stop, the circuit breaker allows calls to go through again.
与重试不同,断路器完全阻止网络调用,这使得该模式对于长期降级特别有用。换句话说,当预期下一次调用将成功时,重试很有用,而当预期下一次调用将失败时,熔断器很有用。
Unlike retries, circuit breakers prevent network calls entirely, which makes the pattern particularly useful for long-term degradations. In other words, retries are helpful when the expectation is that the next call will succeed, while circuit breakers are helpful when the expectation is that the next call will fail.
断路器被实现为状态机,可以处于三种状态之一:断开、闭合和半断开(见图16.3)。
The circuit breaker is implemented as a state machine that can be in one of three states: open, closed and half-open (see Figure 16.3).
图 16.3:断路器状态机
Figure 16.3: Circuit breaker state machine
在关闭状态下,断路器仅充当网络调用的通路。在此状态下,断路器会跟踪故障数量,例如错误和超时。如果该数字在预定义的时间间隔内超过特定阈值,断路器就会跳闸并断开电路。
In the closed state, the circuit breaker is merely acting as a pass-through for network calls. In this state, the circuit breaker tracks the number of failures, like errors and timeouts. If the number goes over a certain threshold within a predefined time-interval, the circuit breaker trips and opens the circuit.
当电路打开时,不会尝试网络调用并立即失败。由于开路断路器可能会产生业务影响,因此您需要仔细考虑当下游依赖项关闭时会发生什么。如果下游依赖关系不重要,您希望服务正常降级,而不是完全停止。
When the circuit is open, network calls aren’t attempted and fail immediately. As an open circuit breaker can have business implications, you need to think carefully what should happen when a downstream dependency is down. If the down-stream dependency is non-critical, you want your service to degrade gracefully, rather than to stop entirely.
想象一架在飞行中失去一个非关键子系统的飞机;它不应该坠毁,而应该优雅地降级到飞机仍然可以飞行和着陆的状态。另一个例子是亚马逊的首页;如果推荐服务不可用,则页面应该呈现而不提供推荐。这比整个页面的渲染完全失败更好。
Think of an airplane that loses one of its non-critical sub-systems in flight; it shouldn’t crash, but rather gracefully degrade to a state where the plane can still fly and land. Another example is Amazon’s front page; if the recommendation service is not available, the page should render without recommendations. It’s a better outcome than to fail the rendering of the whole page entirely.
经过一段时间后,断路器决定再给下游依赖关系一次机会,并转换到半开状态。在半开状态下,允许下一次调用传递到下游服务。如果调用成功,断路器转入关闭状态;如果调用失败,它会转换回打开状态。
After some time has passed, the circuit breaker decides to give the downstream dependency another chance, and transitions to the half-open state. In the half-open state, the next call is allowed to pass-through to the downstream service. If the call succeeds, the circuit breaker transitions to the closed state; if the call fails instead, it transitions back to the open state.
这确实是了解断路器如何工作的全部内容,但细节决定成败。多少次故障足以考虑下游依赖关系下降?断路器应等待多长时间才能从分闸状态转变为半分闸状态?这实际上取决于您的具体情况;只有利用过去失败的数据,您才能做出明智的决定。
That’s really all there is to understand how a circuit breaker works, but the devil is in the details. How many failures are enough to consider a downstream dependency down? How long should the circuit breaker wait to transition from the open to the half-open state? It really depends on your specific case; only by using data about past failures can you make an informed decision.
到目前为止,我们已经讨论了防止下游故障的模式,例如无法到达外部依赖项。在本章中,我们将转变方向并讨论抵御上游压力的机制。
So far, we have discussed patterns that protect against downstream failures, like failures to reach an external dependency. In this chapter, we will shift gears and discuss mechanisms to protect against upstream pressure.
服务器几乎无法控制在任何给定时间接收到的请求数量,这会严重影响其性能。
A server has very little control over how many requests it receives at any given time, which can deeply impact its performance.
操作系统的每个端口都有一个容量有限的连接队列,一旦达到该容量,就会导致新的连接尝试立即被拒绝。但通常情况下,在极端负载下,服务器会在达到限制之前停止运行,因为它会耗尽内存、线程、套接字或文件等资源。这会导致响应时间增加到服务器对外界不可用的程度。
The operating system has a connection queue per port with a limited capacity that, when reached, causes new connection attempts to be rejected immediately. But typically, under extreme load, the server crawls to a halt before that limit is reached as it starves out of resources like memory, threads, sockets, or files. This causes the response time to increase to the point the server becomes unavailable to the outside world.
当服务器满负荷运行时,它没有充分的理由继续接受新请求,因为这最终只会降低服务器的性能。在这种情况下,流程应该开始拒绝多余的请求,以便它可以专注于已经正在处理的请求。
When a server operates at capacity, there is no good reason for it to keep accepting new requests since that will only end up degrading it. In that case, the process should start rejecting excess requests so that it can focus on the ones it is already processing.
过载的定义取决于您的系统,但总体思路是它应该是可测量和可操作的。例如,正在处理的并发请求数是衡量服务器负载的良好候选者;您所要做的就是在收到新请求时增加计数器,并在服务器处理完该请求并将响应发送回客户端后减少计数器。
The definition of overload depends on your system, but the general idea is that it should be measurable and actionable. For example, the number of concurrent requests being processed is a good candidate to measure a server’s load; all you have to do is to increment a counter when a new request comes in and decrease it when the server has processed it and sent back a response to the client.
当服务器检测到它过载时,它可以通过快速失败并在响应中返回503 (服务不可用)状态代码来拒绝传入请求。该技术也称为减载。不过,服务器不一定要拒绝任意请求;例如,如果不同的请求具有不同的优先级,则服务器可以仅拒绝较低优先级的请求。
When the server detects that it’s overloaded, it can reject incoming requests by failing fast and returning a 503 (Service Unavailable) status code in the response. This technique is also referred to as load shedding. The server doesn’t necessarily have to reject arbitrary requests though; for example, if different requests have different priorities, the server could reject only the lower-priority ones.
不幸的是,拒绝请求并不能完全减轻服务器处理请求的成本。根据拒绝的实现方式,服务器可能仍然需要付出打开 TLS 连接并读取请求以最终拒绝它的代价。因此,减载的作用有限,如果负载持续增加,最终,拒绝请求的成本就会占据主导地位,服务开始下降。
Unfortunately, rejecting a request doesn’t completely offload from the server the cost of handling it. Depending on how the rejection is implemented, the server might still have to pay the price of opening a TLS connection and read the request just to finally reject it. Hence, load shedding can only help so much, and if the load keeps increasing, eventually, the cost of rejecting requests takes over, and the service starts to degrade.
负载均衡是负载卸载的替代方案,当客户端不希望在短时间内得到响应时可以使用负载均衡。
Load leveling is an alternative to load shedding, which can be used when clients don’t expect a response within a short time frame.
这个想法是在客户端和服务之间引入消息传递通道。通道将定向到服务的负载与其容量解耦,允许服务按照自己的节奏处理请求,而不是由客户端将请求推送到服务,而是由服务从通道拉取请求。这种模式称为负载均衡,它非常适合抵御短暂的尖峰,通道会平滑这些尖峰(见图17.1)。
The idea is to introduce a messaging channel between the clients and the service. The channel decouples the load directed to the service from its capacity, allowing the service to process requests at its own pace — rather than requests being pushed to the service by the clients, they are pulled by the service from the channel. This pattern is referred to as load leveling and it’s well suited to fend off short-lived spikes, which the channel smoothes out (see Figure 17.1).
图 17.1:通道平滑了消费服务的负载。
Figure 17.1: The channel smooths out the load for the consuming service.
负载削减和负载均衡并不直接解决负载的增加,而是保护服务免于过载。为了处理更多负载,需要横向扩展服务。这就是为什么这些保护机制通常与自动扩展相结合,自动扩展会检测到服务正在运行,并自动增加其规模以处理额外的负载。
Load-shedding and load leveling don’t address an increase in load directly, but rather protect a service from getting overloaded. To handle more load, the service needs to be scaled out. This is why these protection mechanisms are typically combined with auto-scaling, which detects that the service is running hot and automatically increases its scale to handle the additional load.
速率限制或限制是一种在超出特定配额时拒绝请求的机制。一项服务可以有多个配额,例如看到的请求数或某个时间间隔内接收到的字节数。配额通常应用于特定用户、API 密钥或 IP 地址。
Rate-limiting, or throttling, is a mechanism that rejects a request when a specific quota is exceeded. A service can have multiple quotas, like for the number of requests seen, or the number of bytes received within a time interval. Quotas are typically applied to specific users, API keys, or IP addresses.
例如,如果每个 API 密钥的配额为每秒 10 个请求的服务平均每秒从特定 API 密钥接收 12 个请求,则平均每秒会拒绝 2 个用该 API 密钥标记的请求。
For example, if a service with a quota of 10 requests per second, per API key, receives on average 12 requests per second from a specific API key, it will on average, reject 2 requests per second tagged with that API key.
当服务对请求进行速率限制时,它需要返回带有特定错误代码的响应,以便发送者知道它因超出配额而失败。对于具有 HTTP API 的服务,最常见的方法是返回状态代码429(请求过多)的响应。回复应包括有关超出哪些配额以及超出多少的更多详细信息;它还可以包含一个Retry-After标头,指示在发出新请求之前需要等待多长时间:
When a service rate-limits a request, it needs to return a response with a particular error code so that the sender knows that it failed because a quota has been breached. For services with HTTP APIs, the most common way to do that is by returning a response with status code 429 (Too Many Requests). The response should include additional details about which quota has been breached and by how much; it can also include a Retry-After header indicating how long to wait before making a new request:
HTTP/1.1 429 Too Many Requests
Retry-After: 60HTTP/1.1 429 Too Many Requests
Retry-After: 60如果客户端应用程序遵守规则,它会在一段时间内停止攻击服务,从而防止非恶意用户错误地垄断它。这可以防止客户端中的错误,这些错误由于某种原因导致客户端无缘无故地重复访问下游服务。
If the client application plays by the rules, it stops hammering the service for some time, protecting it from non-malicious users monopolizing it by mistake. This protects against bugs in the clients that, for one reason or another, cause a client to repeatedly hit a downstream service for no good reason.
速率限制还用于强制执行定价等级;如果用户想要使用更多的资源,他们也需要准备支付更多的费用。您可以通过这种方式将服务成本分摊给用户:让他们根据使用量按比例付费,并通过配额强制执行定价等级。
Rate-limiting is also used to enforce pricing tiers; if a user wants to use more resources, they also need to be prepared to pay more. This is how you can offload your service’s cost to your users: have them pay proportionally to their usage and enforce pricing tiers with quotas.
您可能会认为速率限制还可以针对拒绝服务(DDoS) 攻击提供强有力的保护,但它只能部分保护服务免受攻击。没有什么可以阻止受限制的客户端在收到429后继续攻击服务。不,速率限制的请求也不是免费的 - 例如,要通过 API 密钥对请求进行速率限制,服务必须支付打开 TLS 连接的费用,并且至少将请求的一部分下载到读取密钥。尽管速率限制并不能完全防范 DDoS 攻击,但它确实有助于减少其影响。
You would think that rate-limiting also offers strong protection against a denial-of-service (DDoS) attack, but it only partially protects a service from it. Nothing forbids throttled clients from continuing to hammer a service after getting 429s. And no, rate-limited requests aren’t free either — for example, to rate-limit a request by API key, the service has to pay the price to open a TLS connection, and to the very least download part of the request to read the key. Although rate-limiting doesn’t fully protect against DDoS attacks, it does help reduce their impact.
规模经济是抵御 DDoS 攻击的唯一真正保护措施。如果在一个大型前端服务后面运行多个服务,无论其后面的哪一个服务受到攻击,前端服务都能够通过拒绝上游流量来抵御攻击。这种方法的优点在于,运行前端服务的成本在使用它的所有服务中分摊。
Economies of scale are the only true protection against DDoS attacks. If you run multiple services behind one large frontend service, no matter which of the services behind it are attacked, the frontend service will be able to withstand the attack by rejecting the traffic upstream. The beauty of this approach is that the cost of running the frontend service is amortized across all the services that are using it.
尽管速率限制与甩负载有一些相似之处,但它们是不同的概念。负载卸载根据进程的本地状态拒绝流量,例如同时处理的请求数量;相反,速率限制会根据系统的全局状态来减少流量,例如所有服务实例中针对特定 API 密钥同时处理的请求总数。
Although rate-limiting has some similarities to load shedding, they are different concepts. Load shedding rejects traffic based on the local state of a process, like the number of requests concurrently processed by it; rate-limiting instead sheds traffic based on the global state of the system, like the total number of requests concurrently processed for a specific API key across all service instances.
速率限制的实现本身就很有趣,并且非常值得花一些时间研究它,因为类似的方法可以应用于其他用例。我们将首先从单进程实现开始,然后再进行分布式实现。
The implementation of rate-limiting is interesting in its own right, and it’s well worth spending some time studying it, as a similar approach can be applied to other use cases. We will start with single-process implementation first and then proceed with a distributed one.
假设我们要强制每个 API 密钥每分钟 2 个请求的配额。一种简单的方法是为每个 API 密钥使用一个双向链表,其中每个列表存储最后收到的 N 个请求的时间戳。每次收到新请求时,都会将一个条目及其相应的时间戳附加到列表中。然后定期从列表中清除早于一分钟的条目。
Suppose we want to enforce a quota of 2 requests per minute, per API key. A naive approach would be to use a doubly-linked list per API key, where each list stores the timestamps of the last N requests received. Every time a new request comes in, an entry is appended to the list with its corresponding timestamp. Then periodically, entries older than a minute are purged from the list.
通过跟踪列表的长度,该进程可以通过将其与配额进行比较来限制传入请求的速率。这种方法的问题在于,它需要每个 API 密钥一个列表,随着收到的请求数量的增加,该列表的内存很快就会变得昂贵。
By keeping track of the list’s length, the process can rate-limits incoming requests by comparing it with the quota. The problem with this approach is that it requires a list per API key, which becomes quickly expensive in terms of memory as it grows with the number of requests received.
为了减少内存消耗,我们需要想出一种方法来压缩所需的存储。实现此目的的一种方法是将时间划分为固定持续时间(例如 1 分钟)的存储桶,并跟踪每个存储桶内已看到的请求数量(参见图 17.2 )。
To reduce memory consumption, we need to come up with a way to compress the storage required. One way to do this is to divide time into buckets of fixed time duration, for example of 1 minute, and keep track of how many requests have been seen within each bucket (see Figure 17.2).
图 17.2:存储桶将时间划分为 1 分钟间隔,跟踪看到的请求数量。
Figure 17.2: Buckets divide time into 1-minute intervals, which keep track of the number of requests seen.
一个桶包含一个数字计数器。当新请求到来时,它的时间戳用于确定它所属的存储桶。例如,如果请求在 18 年 12 月 00 日到达,则“12.00”分钟的存储桶计数器将增加 1(参见图17.3)。
A bucket contains a numerical counter. When a new request comes in, its timestamp is used to determine the bucket it belongs to. For example, if a request arrives at 12.00.18, the counter of the bucket for minute “12.00” is incremented by 1 (see Figure 17.3).
图 17.3:当新请求到来时,它的时间戳用于确定它所属的存储桶。
Figure 17.3: When a new request comes in, its timestamp is used to determine the bucket it belongs to.
通过分桶,我们可以压缩有关请求数量的信息,使其不会随着请求数量的增长而增长。现在我们有了一个内存友好的表示,我们如何使用它来实现速率限制?这个想法是使用一个在存储桶之间实时移动的滑动窗口,跟踪其中的请求数量。
With bucketing, we can compress the information about the number of requests seen in a way that doesn’t grow as the number of requests does. Now that we have a memory-friendly representation, how can we use it to implement rate-limiting? The idea is to use a sliding window that moves in real-time across the buckets, keeping track of the number of requests within it.
滑动窗口表示用于决定是否进行速率限制的时间间隔。窗口的长度取决于用于定义配额的时间单位,在我们的示例中为 1 分钟。但是,有一个警告:滑动窗口可以与多个存储桶重叠。为了得出滑动窗口下的请求数量,我们必须计算存储桶计数器的加权和,其中每个存储桶的权重与其与滑动窗口的重叠程度成正比(见图17.4 )。
The sliding window represents the interval of time used to decide whether to rate-limit or not. The window’s length depends on the time unit used to define the quota, which in our case is 1 minute. But, there is a caveat: a sliding window can overlap with multiple buckets. To derive the number of requests under the sliding window, we have to compute a weighted sum of the bucket’s counters, where each bucket’s weight is proportional to its overlap with the sliding window (see Figure 17.4).
图 17.4:桶的重量与其与滑动窗口的重叠程度成正比。
Figure 17.4: A bucket’s weight is proportional to its overlap with the sliding window.
尽管这是一个近似值,但对于我们的目的而言,这是一个相当不错的近似值。并且,可以通过增加桶的粒度来使其更加准确。例如,您可以使用 30 秒的时间段而不是 1 分钟的时间段来减少近似误差。
Although this is an approximation, it’s a reasonably good one for our purposes. And, it can be made more accurate by increasing the granularity of the buckets. For example, you can reduce the approximation error using 30-second buckets rather than 1-minute ones.
我们只需存储滑动窗口在任何给定时间可以重叠的尽可能多的桶。例如,对于 1 分钟窗口和 1 分钟桶长度,滑动窗口最多可以接触 2 个桶。如果它最多可以接触两个桶,则没有必要存储第三个最旧的桶、第四个最旧的桶,依此类推。
We only have to store as many buckets as the sliding window can overlap with at any given time. For example, with a 1-minute window and a 1-minute bucket length, the sliding window can touch at most 2 buckets. And if it can touch at most two buckets, there is no point to store the third oldest bucket, the fourth oldest one, and so on.
总而言之,这种方法需要每个 API 密钥有两个计数器,这在内存方面比为每个 API 密钥存储请求列表的简单实现要高效得多。
To summarize, this approach requires two counters per API key, which is much more efficient in terms of memory than the naive implementation storing a list of requests per API key.
当多个进程接受请求时,本地状态不再削减请求,因为需要对所有服务实例中每个 API 密钥的请求总数强制执行配额。这需要一个共享数据存储来跟踪所看到的请求数量。
When more than one process accepts requests, the local state no longer cuts it, as the quota needs to be enforced on the total number of requests per API key across all service instances. This requires a shared data store to keep track of the number of requests seen.
如前所述,我们需要为每个 API 密钥存储两个整数,每个存储桶一个。当新请求到来时,接收它的进程可以获取存储桶,更新它并将其写回数据存储。但是,这是行不通的,因为两个进程可以同时更新同一个存储桶,这将导致更新丢失。为了避免任何竞争条件,获取、更新和写入操作需要打包到单个事务中。
As discussed earlier, we need to store two integers per API key, one for each bucket. When a new request comes in, the process receiving it could fetch the bucket, update it and write it back to the data store. But, that wouldn’t work because two processes could update the same bucket concurrently, which would result in a lost update. To avoid any race conditions, the fetch, update, and write operations need to be packaged into a single transaction.
尽管这种方法在功能上是正确的,但成本高昂。这里有两个问题:事务速度很慢,并且每个请求执行一个事务将非常昂贵,因为数据库必须随着请求数量线性扩展。最重要的是,对于进程收到的每个请求,它都需要对远程数据存储进行传出调用。如果失败了该怎么办?
Although this approach is functionally correct, it’s costly. There are two issues here: transactions are slow, and executing one per request would be crazy expensive as the database would have to scale linearly with the number of requests. On top of that, for each request a process receives, it needs to do an outgoing call to a remote data store. What should it do if it fails?
让我们来解决这些问题。我们可以使用大多数数据存储提供的单个原子获取和增量操作,而不是使用事务。或者,可以使用Compare-and-swap来模拟相同的情况。原子操作比事务具有更好的性能。
Let’s address these issues. Rather than using transactions, we can use a single atomic get-and-increment operation that most data stores provide. Alternatively, the same can be emulated with a compare-and-swap. Atomic operations have much better performance than transactions.
现在,该进程可以在内存中批量更新一段时间,并在结束时将它们异步刷新到数据库,而不是在每个请求上更新数据库(参见图 17.5 )。这会降低共享状态的准确性,但这是一个很好的权衡,因为它减少了数据库的负载和发送到数据库的请求数量。
Now, rather than updating the database on each request, the process can batch bucket updates in memory for some time, and flush them asynchronously to the database at the end of it (see Figure 17.5). This reduces the shared state’s accuracy, but it’s a good trade-off as it reduces the load on the database and the number of requests sent to it.
图 17.5:服务器在内存中批量更新存储桶一段时间,并在结束时将它们异步刷新到数据库。
Figure 17.5: Servers batch bucket updates in memory for some time, and flush them asynchronously to the database at the end of it.
如果数据库宕机了会发生什么?请记住 CAP 定理的本质:当出现网络故障时,我们可以牺牲一致性并保持系统正常运行,或者保持一致性并停止服务请求。在我们的例子中,仅仅因为用于速率限制的数据库不可访问而暂时拒绝所有传入请求可能会对业务造成很大损害。相反,根据从存储中读取的最后状态来继续服务请求会更安全。
What happens if the database is down? Remember the CAP theorem’s essence: when there is a network fault, we can either sacrifice consistency and keep our system up, or maintain consistency and stop serving requests. In our case, temporarily rejecting all incoming requests just because the database used for rate-limiting is not reachable could be very damaging to the business. Instead, it’s safer to keep serving requests based on the last state read from the store.
舱壁模式的目标是隔离服务某一部分的故障,以免导致整个服务崩溃。该图案以船体的分区命名。如果一个隔板损坏并充满水,泄漏将被隔离到该隔板,不会扩散到船舶的其他部分。
The goal of the bulkhead pattern is to isolate a fault in one part of a service from taking the entire service down with it. The pattern is named after the partitions of a ship’s hull. If one partition is damaged and fills up with water, the leak is isolated to that partition and doesn’t spread to the rest of the ship.
某些客户端可能会比其他客户端在服务上产生更多的负载。如果没有任何保护,一个贪婪的客户端就可以攻击系统并降低所有其他客户端的性能。我们已经看到了一些模式,例如速率限制,有助于防止单个客户端使用过多的资源。但速率限制并非万无一失。您可以根据每秒的请求数对客户端进行速率限制;但是,如果客户端发送非常重或有毒的请求导致服务器性能下降怎么办?在这种情况下,速率限制不会有太大帮助,因为问题是该客户端发送的请求所固有的,这最终可能导致每个其他客户端的服务质量下降。
Some clients can create much more load on a service than others. Without any protections, a single greedy client can hammer the system and degrade every other client. We have seen some patterns, like rate-limiting, that help prevent a single client from using more resources than it should. But rate-limiting is not bulletproof. You can rate-limit clients based on the number of requests per second; but what if a client sends very heavy or poisonous requests that cause the servers to degrade? In that case, rate-limiting wouldn’t help much as the issue is intrinsic with the requests sent by that client, which could eventually lead to degrading the service for every other client.
当其他一切都失败时,舱壁模式通过设计提供有保证的故障隔离。其想法是对共享资源进行分区,例如负载均衡器后面的服务实例池,并将服务的每个用户分配到特定分区,以便其请求只能利用属于其分配到的分区的资源。
When everything else fails, the bulkhead pattern provides guaranteed fault isolation by design. The idea is to partition a shared resource, like a pool of service instances behind a load balancer, and assign each user of the service to a specific partition so that its requests can only utilize resources belonging to the partition it’s assigned to.
因此,重度或有毒用户只能降低同一分区内用户的请求。例如,假设负载均衡器后面有 10 个服务实例,它们被分为 5 个分区(见图17.6)。在这种情况下,有问题的用户只能影响 20% 的服务实例。问题是,恰好与有问题的用户位于同一分区的不幸用户会受到完全影响。我们可以做得更好吗?
Consequently, a heavy or poisonous user can only degrade the requests of users within the same partition. For example, suppose there are 10 instances of a service behind a load balancer, which are divided into 5 partitions (see Figure 17.6). In that case, a problematic user can only ever impact 20 percent of the service’s instances. The problem is that the unlucky users who happen to be on the same partition as the problematic one are fully impacted. Can we do better?
图 17.6:服务实例分为 5 个分区
Figure 17.6: Service instances partitioned into 5 partitions
我们可以引入由实例的随机子集组成的虚拟分区。这使得其他用户更不可能被分配到完全相同的虚拟分区。
We can introduce virtual partitions that are composed of a random subset of instances. This can make it much more unlikely for another user to be allocated to the exact same virtual partition.
在我们的示例中,我们可以从 10 个实例的池中提取 2 个实例(虚拟分区)的 45 种组合。当虚拟分区降级时,其他虚拟分区仅部分受到影响,因为它们不完全重叠(参见图17.7)。如果将其与负载均衡器上的运行状况检查以及客户端的重试机制结合起来,您将获得更好的故障隔离。
In our example, we can extract 45 combinations of 2 instances (virtual partitions) from a pool of 10 instances. When a virtual partition is degraded, other virtual partitions are only partially impacted as they don’t fully overlap (see Figure 17.7). If you combine this with a health check on the load balancer, and a retry mechanism on the client side, what you get is much better fault isolation.
图 17.7:虚拟分区彼此完全重叠的可能性要小得多。
Figure 17.7: Virtual partitions are far less likely to fully overlap with each other.
应用舱壁图案时需要小心;如果做得太过分并创建太多分区,您就会失去在不同时间活跃的一组用户之间共享昂贵资源所带来的所有规模经济优势。
You need to be careful when applying the bulkhead pattern; if you take it too far and create too many partitions, you lose all the economy-of-scale benefits of sharing costly resources across a set of users that are active at different times.
您还引入了缩放问题。当没有分区并且每个用户都可以由任何实例提供服务时,扩展很简单,因为您只需添加更多实例即可。对于分区实例池来说并不是那么容易,因为某些分区比其他分区热得多。
You also introduce a scaling problem. Scaling is simple when there are no partitions and every user can be served by any instance, as you can just add more instances. It’s not that easy with a partitioned pool of instances as some partitions are much hotter than others.
到目前为止,我们已经探索了允许进程放弃或拒绝传入请求的模式。这些是服务器仅在收到请求后才能应用的缓解措施。如果有一种方法来控制传入流量,使其不会首先到达降级的服务器,这不是很好吗?
So far, we have explored patterns that allow a process to shed or reject incoming requests. Those are mitigations a server can apply only after it has received a request. Wouldn’t it be nice to have a way to control the incoming traffic so that it doesn’t reach a degraded server in the first place?
如果服务器位于负载均衡器后面并且可以传达其过载的信息,则均衡器可以停止向其发送请求。该进程可以公开一个运行状况端点,在查询时执行运行状况检查,如果该进程可以处理请求,则返回200(正常);如果过载并且没有更多容量来处理请求,则返回错误代码。
If the server is behind a load balancer and can communicate that it’s overloaded, the balancer can stop sending requests to it. The process can expose a health endpoint that when queried performs a health check that either returns 200 (OK) if the process can serve requests, or an error code if it’s overloaded and doesn’t have more capacity to serve requests.
负载均衡器定期查询运行状况端点。如果端点返回错误,负载均衡器会认为该进程不健康并将其从池中取出。同样,如果对健康端点的请求超时,该进程也会从池中取出。
The health endpoint is periodically queried by the load balancer. If the endpoint returns an error, the load balancer considers the process unhealthy and takes it out of the pool. Similarly, if the request to the health endpoint times out, the process is also taken out of the pool.
健康检查对于实现高可用性至关重要;如果您有一个包含 10 台服务器的服务,其中一台由于某种原因没有响应,那么 10% 的请求将失败,这将导致该服务的可用性下降到 90%。
Health checks are critical to achieving high availability; if you have a service with 10 servers and one is unresponsive for some reason, then 10% of the requests will fail, which will cause the service’s availability to drop to 90%.
让我们看一下您可以在服务中利用的不同类型的运行状况检查。
Let’s have a look at the different types of health checks that you can leverage in your service.
活性健康测试是检查进程健康状况的最基本形式。负载均衡器仅执行基本的 HTTP 请求,以查看进程是否回复200(正常)状态代码。
A liveness health test is the most basic form of checking the health of a process. The load balancer simply performs a basic HTTP request to see whether the process replies with a 200 (OK) status code.
本地运行状况测试检查进程是否降级或处于某种错误状态。当本地资源(如内存、CPU 或磁盘)接近完全饱和或完全饱和时,进程的性能通常会下降。为了检测性能下降,该过程会将一个或多个本地指标(例如可用内存或剩余磁盘空间)与一些固定的上限和下限阈值进行比较。当指标高于上限阈值或低于下限阈值时,进程会报告自身运行状况不佳。
A local health test checks whether the process is degraded or in some faulty state. The process’s performance typically degrades when a local resource, like memory, CPU, or disk, is either close enough to be fully saturated, or is completely saturated. To detect a degradation, the process compares one or more local metrics, like memory available or remaining disk space, with some fixed upper and lower-bound thresholds. When a metric is above an upper-bound threshold, or below a lower-bound one, the process reports itself as unhealthy.
一种更高级、也更难正确执行的检查是依赖项健康检查。这种类型的运行状况检查检测由远程依赖项(例如数据库)引起的性能下降,需要访问该依赖项来处理传入请求。该过程测量针对依赖项的远程调用的响应时间、超时和错误。如果任何措施打破了预定义的阈值,进程就会将自身报告为不健康,以减少下游依赖项的负载。
A more advanced, and also harder check to get right, is the dependency health check. This type of health check detects a degradation caused by a remote dependency, like a database, that needs to be accessed to handle incoming requests. The process measures the response time, timeouts, and errors of the remote calls directed to the dependency. If any measure breaks a predefined threshold, the process reports itself as unhealthy to reduce the load on the downstream dependency.
但问题是:如果下游依赖项暂时无法访问,或者运行状况检查有错误,那么负载均衡器后面的所有进程都可能无法通过运行状况检查。在这种情况下,简单的负载均衡器只会停止所有服务实例的轮换,从而导致整个服务瘫痪!
But here be dragons: if the downstream dependency is temporarily unreachable, or the health-check has a bug, then it’s possible that all the processes behind the load balancer fail the health check. In that case, a naive load balancer would just take all service instances out of rotation, bringing the entire service down!
相反,智能负载均衡器检测到大部分服务实例被报告为不健康,并认为健康检查不再可靠。它不再继续从池中删除进程,而是开始完全忽略运行状况检查,以便可以将新请求发送到池中的任何进程。
A smart load balancer instead detects that a large fraction of the service instances is being reported as unhealthy and considers the health check to no longer be reliable. Rather than continuing to remove processes from the pool, it starts to ignore the health-checks altogether so that new requests can be sent to any process in the pool.
构建分布式服务的主要原因之一是能够承受单进程故障。由于您在设计系统时假设任何进程都可能随时崩溃,因此您的服务需要能够处理这种可能性。
One of the main reasons to build distributed services is to be able to withstand single-process failures. Since you are designing your system under the assumption that any process can crash at any time, your service needs to be able to deal with that eventuality.
为了使进程崩溃不影响服务的运行状况,理想情况下您应该确保:
For a process’s crash to not affect your service’s health, you should ensure ideally that:
因为崩溃是不可避免的,并且您的服务已为此做好准备,所以当进程进入某种奇怪的降级状态时,您不必想出复杂的恢复逻辑 - 您可以让它崩溃。短暂但罕见的故障可能很难诊断和修复。崩溃并重新启动受影响的进程可以为维护服务的操作员提供一些喘息空间,直到找到根本原因,从而使系统具有一种自我修复特性。
Because crashes are inevitable and your service is prepared for them, you don’t have to come up with complex recovery logic when a process gets into some weird degraded state — you can just let it crash. A transient but rare failure can be hard to diagnose and fix. Crashing and restarting the affected process gives operators maintaining the service some breathing room until the root-cause can be identified, giving the system a kind of self-healing property.
想象一下,潜在的内存泄漏会导致可用内存随着时间的推移而减少。当进程没有更多可用物理内存时,它开始来回交换到磁盘上的页面文件。这种交换非常昂贵并且会显着降低进程的性能。如果不加以控制,内存泄漏最终将使所有运行该服务的进程陷入瘫痪。您想让进程检测到它们已降级并自行重新启动,还是尝试在凌晨 3 点调试降级的根本原因?
Imagine that a latent memory leak causes the available memory to decrease over time. When a process doesn’t have more physical memory available, it starts to swap back and forth to the page file on disk. This swapping is extremely expensive and degrades the process’s performance dramatically. If left unchecked, the memory leak would eventually bring all processes running the service on their knees. Would you rather have the processes detect they are degraded and restart themselves, or try to debug the root cause for the degradation at 3 AM?
为了实现这种模式,进程应该有一个定期唤醒的单独后台线程(一个看门狗)来监视其运行状况。例如,看门狗可以监视剩余的可用物理内存。当任何受监控的指标违反配置的阈值时,看门狗会认为该进程已降级并故意重新启动它。
To implement this pattern, a process should have a separate background thread that wakes up periodically — a watchdog — that monitors its health. For example, the watchdog could monitor the available physical memory left. When any monitored metric breaches a configured threshold, the watchdog considers the process degraded and deliberately restarts it.
看门狗的实现需要经过充分的测试和监控,因为错误可能会导致进程不断重新启动。
The watchdog’s implementation needs to be well-tested and monitored since a bug could cause the processes to restart continuously.
当所有弹性机制失效时,人类操作员是最后一道防线。从历史上看,开发人员、测试人员和操作人员属于不同团队。开发人员将他们的软件交给了负责测试的 QA 工程师团队。当软件通过该阶段后,它会转移到运营团队,负责将其部署到生产中、对其进行监控并响应警报。
When all resiliency mechanisms fail, humans operators are the last line of defense. Historically, developers, testers, and operators were part of different teams. The developers handed over their software to a team of QA engineers responsible for testing it. When the software passed that stage, it moved to an operations team responsible for deploying it to production, monitoring it, and responding to alerts.
这种模式正在行业中逐步淘汰,因为开发团队还负责测试和操作他们编写的软件已变得司空见惯。这迫使开发人员接受其应用程序的端到端视图,承认错误是不可避免的并且需要考虑在内。
This model is being phased out in the industry as it has become commonplace for the development team to also be responsible for testing and operating the software they write. This forces the developers to embrace an end-to-end view of their applications, acknowledging that faults are inevitable and need to be accounted for.
第18章介绍了不同类型的测试——单元测试、集成测试和端到端测试——您可以利用这些测试来增强分布式应用程序按预期工作的信心。
Chapter 18 describes the different types of tests — unit, integration, and end-to-end tests — you can leverage to increase the confidence that your distributed applications work as expected.
第19章深入探讨了用于安全高效地将变更发布到生产环境的持续交付和部署管道。
Chapter 19 dives into continuous delivery and deployment pipelines used to release changes safely and efficiently to production.
第20章讨论如何使用度量和服务级别指标来监控分布式系统的健康状况。然后,它描述了如何定义在违反时触发警报的目标。最后,本章列出了仪表板设计的最佳实践。
Chapter 20 discusses how to use metrics and service-level indicators to monitor the health of distributed systems. It then describes how to define objectives that trigger alerts when breached. Finally, the chapter lists best practices for dashboard design.
第21章介绍了可观测性的概念及其与监控的关系。然后它描述了跟踪和日志如何帮助开发人员调试他们的系统。
Chapter 21 introduces the concept of observability and how it relates to monitoring. Then it describes how traces and logs can help developers debug their systems.
检测错误所需的时间越长,修复错误的成本就越高。测试的目的就是尽早发现错误,让开发人员能够放心地更改实现,确保现有功能不会破坏,提高重构速度,发布新功能和其他更改。作为一个受欢迎的副作用,测试还改进了系统的设计,因为开发人员必须站在用户的立场上进行有效的测试。测试还提供最新的文档。
The longer it takes to detect a bug, the more expensive it becomes to fix it. Testing is all about catching bugs as early as possible, allowing developers to change the implementation with confidence that existing functionality won’t break, increasing the speed of refactorings, shipping new features, and other changes. As a welcome side effect, testing also improves the system’s design since developers have to put themselves in the users’ shoes to test it effectively. Tests also provide up-to-date documentation.
不幸的是,因为不可能预测复杂的分布式应用程序可能失败的所有方式,所以测试只能尽力保证被测试的代码是正确的和容错的。无论测试覆盖范围多么详尽,测试只能覆盖开发人员可以想象的故障,而不是那种仅在生产中表现出来的复杂紧急行为1。
Unfortunately, because it’s impossible to predict all the ways a complex distributed application can fail, testing only provides best-effort guarantees that the code being tested is correct and fault-tolerant. No matter how exhaustive the test coverage is, tests can only cover failures developers can imagine, not the kind of complex emergent behavior that manifests itself only in production1.
尽管测试不能让您完全确信您的代码没有错误,但它们确实可以很好地检测您意识到的故障场景并验证预期行为。根据经验,如果您想确信您的实现以某种方式运行,则必须为其添加测试。
Although tests can’t give you complete confidence that your code is bug-free, they certainly do a good job at detecting failure scenarios you are aware of and validating expected behaviors. As a rule of thumb, if you want to be confident that your implementation behaves in a certain way, you have to add a test for it.
测试有不同的形状和大小。首先,我们需要区分测试实际测试的代码路径(也称为被测系统或 SUT)和正在运行的代码路径。SUT 代表测试的范围,根据它,测试可以分为单元测试、集成测试或端到端测试。
Tests come in different shapes and sizes. To begin with, we need to distinguish between code paths a test is actually testing (aka system under test or SUT) from the ones that are being run. The SUT represents the scope of the test, and depending on it, the test can be categorized as either a unit test, an integration test, or an end-to-end test.
单元测试验证一小部分代码库的行为,例如单个类。一个好的单元测试在时间上应该是相对静态的,并且只有当 SUT 的行为发生变化时才会发生变化——重构、修复 bug 或添加新功能不应该需要改变单元测试。为了实现这一点,单元测试应该:
A unit test validates the behavior of a small part of the codebase, like an individual class. A good unit test should be relatively static in time and change only when the behavior of the SUT changes — refactoring, fixing a bug, or adding a new feature shouldn’t require a unit test to change. To achieve that, a unit test should:
集成测试的范围比单元测试更大,因为它验证服务是否可以按预期与其外部依赖项进行交互。不过,这个定义并不通用,因为集成测试对于不同的人有不同的含义。
An integration test has a larger scope than a unit test, since it verifies that a service can interact with its external dependencies as expected. This definition is not universal, though, because integration testing has different meanings for different people.
Martin Fowler区分了狭义集成测试和广义集成测试。狭义集成测试仅测试与外部依赖项(例如适配器及其支持类)通信的服务的代码路径。相比之下,广泛的集成测试会测试跨多个实时服务的代码路径。
Martin Fowler makes the distinction between narrow and broad integration tests. A narrow integration test exercises only the code paths of a service that communicate with an external dependency, like the adapters and their supporting classes. In contrast, a broad integration test exercises code paths across multiple live services.
在本章的其余部分中,我们将把这些更广泛的集成测试称为端到端测试。端到端测试验证系统中跨多个服务的行为,例如面向用户的场景。这些测试通常在共享环境中运行,例如登台或生产。由于其范围有限,它们速度缓慢并且更容易出现间歇性故障。
In the rest of the chapter, we will refer to these broader integration tests as end-to-end tests. An end-to-end test validates behavior that spans multiple services in the system, like a user-facing scenario. These tests usually run in shared environments, like staging or production. Because of their scope, they are slow and more prone to intermittent failures.
端到端测试不应对共享同一环境的其他测试或用户产生任何影响。除此之外,这要求服务具有良好的故障隔离机制,例如速率限制,以防止有错误的测试影响系统的其余部分。
End-to-end tests should not have any impact on other tests or users sharing the same environment. Among other things, that requires services to have good fault isolation mechanisms, like rate-limiting, to prevent buggy tests from affecting the rest of the system.
端到端测试可能非常痛苦且维护成本高昂。例如,当端到端测试失败时,并不总是很明显哪个服务负责,需要进行更深入的调查。但它们是确保面向用户的场景在整个应用程序中按预期工作的必要之害。他们可以发现较小范围的测试无法发现的问题,例如意外的副作用和紧急行为。
End-to-end tests can be painful and expensive to maintain. For example, when an end-to-end test fails, it’s not always obvious which service is responsible and deeper investigation is required. But they are a necessary evil to ensure that user-facing scenarios work as expected across the entire application. They can uncover issues that tests with smaller scope can’t, like unanticipated side effects and emergent behaviors.
减少端到端测试数量的一种方法是将它们构建为用户旅程测试。用户旅程测试模拟用户与系统的多步骤交互(例如,对于电子商务服务:创建订单、修改订单、最后取消订单)。与将测试分成 N 个单独的端到端测试相比,此类测试通常需要更少的运行时间。
One way to minimize the number of end-to-end tests is to frame them as user journey tests. A user journey test simulates a multi-step interaction of a user with the system (e.g. for e-commerce service: create an order, modify it, and finally cancel it). Such a test usually requires less time to run than splitting the test into N separate end-to-end tests.
随着测试范围的扩大,它会变得更加脆弱、缓慢且成本高昂。间歇性失败的测试几乎与根本没有测试一样糟糕,因为开发人员不再对它们有任何信心,并最终忽略它们的失败。如果可能,首选范围较小的测试,因为它们往往更可靠、更快且更便宜。一个好的权衡是进行大量的单元测试、少量的集成测试以及更少的端到端测试(见图 18.1 )。
As the scope of a test increases, it becomes more brittle, slow, and costly. Intermittently-failing tests are nearly as bad as no tests at all, as developers stop having any confidence in them and eventually ignore their failures. When possible, prefer tests with smaller scope as they tend to be more reliable, faster, and cheaper. A good trade-off is to have a large number of unit tests, a smaller fraction of integration tests, and even fewer end-to-end tests (see Figure 18.1).
图 18.1:测试金字塔
Figure 18.1: Test pyramid
测试的大小反映了运行需要多少计算资源,例如节点数量。一般来说,这取决于测试运行的环境的真实程度。尽管测试的范围和规模往往是相关的,但它们是不同的概念,有助于区分它们。
The size of a test reflects how much computing resources it needs to run, like the number of nodes. Generally, that depends on how realistic the environment is where the test runs. Although the scope and size of a test tend to be correlated, they are distinct concepts, and it helps to separate them.
小型测试在单个进程中运行,并且不执行任何阻塞调用或 I/O。它非常快、具有确定性,并且间歇性失败的可能性非常小。
A small test runs in a single process and doesn’t perform any blocking calls or I/O. It’s very fast, deterministic, and has a very small probability of failing intermittently.
中间测试在单个节点上运行并执行本地 I/O,例如从磁盘读取或对本地主机的网络调用。这带来了更多的延迟和不确定性,增加了间歇性故障的可能性。
An intermediate test runs on a single node and performs local I/O, like reads from disk or network calls to localhost. This introduces more room for delays and non-determinism, increasing the likelihood of intermittent failures.
大型测试需要运行多个节点,从而引入更多的不确定性和更长的延迟。
A large test requires multiple nodes to run, introducing even more non-determinism and longer delays.
毫不奇怪,测试规模越大,运行时间就越长,并且变得越不稳定。这就是为什么您应该为给定行为编写尽可能最小的测试。但是如何在不缩小测试范围的情况下缩小测试规模呢?
Unsurprisingly, the larger a test is, the longer it takes to run and the flakier it becomes. This is why you should write the smallest possible test for a given behavior. But how do you reduce the size of a test, while not reducing its scope?
您可以使用测试替身代替真正的依赖项来减少测试的大小,使其更快并且不易出现间歇性故障。有不同类型的测试替身:
You can use a test double in place of a real dependency to reduce the test’s size, making it faster and less prone to intermittent failures. There are different types of test doubles:
测试替身的问题在于,它们与真实实现的行为及其所有细微差别并不相似。相似度越小,您对使用双精度数的测试实际上有用的信心就越小。因此,当真正的实现快速、确定性并且依赖性很少时,请使用它而不是 double。如果情况并非如此,您必须决定测试替身的真实程度,因为在其保真度和测试大小之间需要进行权衡。
The problem with test doubles is that they don’t resemble how the real implementation behaves with all its nuances. The less the resemblance is, the less confidence you should have that the test using the double is actually useful. Therefore, when the real implementation is fast, deterministic, and has few dependencies, use that rather than a double. If that’s not the case, you have to decide how realistic you want the test double to be, as there is a tradeoff between its fidelity and the test’s size.
当无法选择使用真正的实现时,请使用由依赖项的同一开发人员维护的假实现(如果可用)。存根或模拟是最后的选择,因为它们与实际实现最不相似,这使得使用它们的测试变得脆弱。
When using the real implementation is not an option, use a fake maintained by the same developers of the dependency, if one is available. Stubbing, or mocking, are last-resort options as they offer the least resemblance to the actual implementation, which makes tests that use them brittle.
对于集成测试,一个很好的折衷方案是在合同测试中使用模拟。合约测试定义了它打算发送到外部依赖项的请求以及它期望从外部依赖项接收的响应。然后测试使用该契约来模拟外部依赖项。例如,REST API 的合约由 HTTP 请求和响应对组成。为了确保契约不被破坏,外部依赖的测试套件使用相同的契约来模拟客户端并确保返回预期的响应。
For integration tests, a good compromise is to use mocking with contract tests. A contract test defines the request it intends to send to an external dependency and the response it expects to receive from it. This contract is then used by the test to mock the external dependency. For example, a contract for a REST API consists of an HTTP request and response pair. To ensure that the contract doesn’t break, the test suite of the external dependency uses the same contract to simulate a client and ensure that the expected response is returned.
与其他一切一样,测试需要做出权衡。
As with everything else, testing requires making tradeoffs.
假设我们想要测试服务提供的面向特定用户的 API 端点的行为。该服务与数据存储、另一个团队拥有的内部服务以及用于计费的第三方 API 进行通信(见图18.2)。如前所述,一般准则是在所需范围内编写尽可能小的测试。
Suppose we want to test the behavior of a specific user-facing API endpoint offered by a service. The service talks to a data store, an internal service owned by another team, and a third-party API used for billing (see Figure 18.2). As mentioned earlier, the general guideline is to write the smallest test possible with the desired scope.
图 18.2:您将如何测试该服务?
Figure 18.2: How would you test the service?
事实证明,端点不需要与内部服务通信,因此我们可以安全地使用模拟来代替它。数据存储带有内存中实现(假的),我们可以利用它来避免向远程数据存储发出网络调用。
As it turns out, the endpoint doesn’t need to communicate with the internal service, so we can safely use a mock in its place. The data store comes with an in-memory implementation (a fake) that we can leverage to avoid issuing network calls to a remote data store.
最后,我们不能使用第三方计费API,因为这需要测试才能发出真实的交易。幸运的是,该 API 有一个不同的端点,它提供了一个游乐场环境,测试可以使用该环境,而无需创建真正的事务。如果没有可用的游乐场环境,也没有假的,我们将不得不诉诸存根或嘲笑。
Finally, we can’t use the third-party billing API, as that would require the test to issue real transactions. Fortunately, the API has a different endpoint that offers a playground environment, which the test can use without creating real transactions. If there was no playground environment available and no fake either, we would have to resort to stubbing or mocking.
在这种情况下,我们大大缩减了测试的规模,同时保持其范围基本完整。
In this case, we have cut the test’s size considerably, while keeping its scope mostly intact.
这是一个更微妙的例子。假设我们需要测试在整个应用程序堆栈中清除属于特定用户的数据是否按预期工作。在欧洲,此功能是法律 (GDPR) 强制规定的,不遵守该功能可能会导致高达 2000 万欧元或 4% 年营业额的罚款,以较高者为准。在这种情况下,由于功能默默中断的风险太高,因此我们希望尽可能确信该功能按预期工作。这保证了使用在生产中运行的端到端测试并使用实时服务而不是测试替身。
Here is a more nuanced example. Suppose we need to test whether purging the data belonging to a specific user across the entire application stack works as expected. In Europe, this functionality is mandated by law (GDPR), and failing to comply with it can result in fines up to 20 million euros or 4% annual turnover, whichever is greater. In this case, because the risk for the functionality silently breaking is too high, we want to be as confident as possible that the functionality is working as expected. This warrants the use of an end-to-end test that runs in production and uses live services rather than test doubles.
Cindy Sridharan 在https://copyconstruct.medium.com/testing-microservices-the-sane-way-9bb31d158c16 ↩︎上撰写了有关该主题的精彩博客文章系列
Cindy Sridharan wrote a great blog post series on the topic at https://copyconstruct.medium.com/testing-microservices-the-sane-way-9bb31d158c16↩︎
一旦更改及其新引入的测试被合并到存储库中,就需要将其发布到生产环境中。
Once a change and its newly introduced tests have been merged to a repository, it needs to be released to production.
当发布更改需要手动过程时,这种情况不会经常发生。这意味着可能需要几天甚至几周的时间进行的一些更改最终会一起批量发布。这使得在部署失败时更难查明重大变更1,从而给整个团队带来干扰。发起发布的开发人员还需要通过监视仪表板和警报来密切关注它,以确保它按预期工作或回滚。
When releasing a change requires a manual process, it won’t happen frequently. Meaning that several changes, possibly over days or even weeks, end up being batched and released together. This makes it harder to pinpoint the breaking change1 when a deployment fails, creating interruptions for the whole team. The developer who initiated the release also needs to keep an eye on it by monitoring dashboards and alerts to ensure that it’s working as expected or roll it back.
手动部署非常浪费工程时间。当服务很多时,这个问题会进一步恶化。最终,安全有效地发布变更的唯一方法是自动化整个过程。一旦更改被合并到存储库,它应该自动安全地部署到生产环境。然后,开发人员可以自由地上下文切换到下一个任务,而不是引导部署。整个发布过程(包括回滚)可以通过持续交付和部署(CD)管道实现自动化。
Manual deployments are a terrible use of engineering time. The problem gets further exacerbated when there are many services. Eventually, the only way to release changes safely and efficiently is to automate the entire process. Once a change has been merged to a repository, it should automatically be rolled out to production safely. The developer is then free to context-switch to their next task, rather than shepherding the deployment. The whole release process, including rollbacks, can be automated with a continuous delivery and deployment (CD) pipeline.
CD 需要在保障、监控和自动化方面进行大量投资。如果检测到回归,则正在发布的工件(即包含更改的可部署组件)要么回滚到上一个版本,要么前进到下一个版本(假设它包含修补程序)。
CD requires a significant amount of investment in terms of safeguards, monitoring, and automation. If a regression is detected, the artifact being released — i.e., the deployable component that includes the change — is either rolled back to the previous version, or forward to the next one, assuming it contains a hotfix.
部署的安全性和发布生产变更所需的时间之间存在平衡。一个好的CD管道应该努力在两者之间做出良好的权衡。在本章中,我们将探讨如何进行。
There is a balance between the safety of a rollout and the time it takes to release a change to production. A good CD pipeline should strive to make a good trade-off between the two. In this chapter, we will explore how.
在较高层面上,代码更改需要经过四个阶段的管道才能发布到生产:审核、构建、预生产部署和生产部署。
At a high level, a code change needs to go through a pipeline of four stages to be released to production: review, build, pre-production rollout, and production rollout.
图 19.1:持续交付和部署管道阶段
Figure 19.1: Continuous delivery and deployment pipeline stages
这一切都始于开发人员向存储库提交供审核的拉取请求 (PR)。当 PR 提交审核时,需要对其进行编译、静态分析并通过一系列测试进行验证,所有这些都不会超过几分钟。为了提高测试速度并最大限度地减少间歇性故障,在此阶段运行的测试应该足够小,可以在单个进程或节点上运行,例如单元测试,而较大的测试仅在管道中稍后运行。
It all starts with a pull request (PR) submitted for review by a developer to a repository. When the PR is submitted for review, it needs to be compiled, statically analyzed, and validated with a battery of tests, all of which shouldn’t take longer than a few minutes. To increase the tests’ speed and minimize intermittent failures, the tests that run at this stage should be small enough to run on a single process or node, like e.g., unit tests, with larger tests only run later in the pipeline.
PR 需要经过团队成员的审查和批准才能合并到存储库中。审核者必须验证更改是否正确且安全,以便 CD 管道自动发布到生产环境。检查表可以帮助审阅者不要忘记任何重要的事情:
The PR needs to be reviewed and approved by a team member before it can be merged into the repository. The reviewer has to validate whether the change is correct and safe to be released to production automatically by the CD pipeline. A checklist can help the reviewer not to forget anything important:
代码更改不应该是唯一经过此审核过程的更改。例如,云资源模板、静态资产、端到端测试和配置文件都应该在存储库中进行版本控制(不一定相同),并像代码一样对待。然后,同一服务可以有多个 CD 管道,每个存储库一个,可以并行运行。
Code changes shouldn’t be the only ones going through this review process. For example, cloud resource templates, static assets, end-to-end tests, and configuration files should all be version-controlled in a repository (not necessarily the same) and be treated just like code. The same service can then have multiple CD pipelines, one for each repository, that can potentially run in parallel.
使用 CD 管道检查和发布配置更改的重要性怎么强调都不为过。生产失败的最常见原因之一是在没有任何事先审查或测试的情况下全局应用配置更改。
I can’t stress enough the importance of reviewing and releasing configuration changes with a CD pipeline. One of the most common causes of production failures are configuration changes applied globally without any prior review or testing.
一旦更改被合并到存储库的主分支中,CD 管道就会进入构建阶段,在该阶段构建存储库的内容并将其打包到可部署的发布工件中。
Once the change has been merged into the repository’s main branch, the CD pipeline moves to the build stage, in which the repository’s content is built and packaged into a deployable release artifact.
在此阶段,工件被部署并发布到合成的预生产环境。尽管此环境缺乏生产的真实性,但验证是否未触发硬故障(例如,由于缺少配置设置而导致启动时出现空指针异常)以及端到端测试是否成功非常有用。由于将新版本发布到预生产阶段所需的时间比将其发布到生产阶段所需的时间要少得多,因此可以更早地检测到错误。
During this stage, the artifact is deployed and released to a synthetic pre-production environment. Although this environment lacks the realism of production, it’s useful to verify that no hard failures are triggered (e.g., a null pointer exception at startup due to a missing configuration setting) and that end-to-end tests succeed. Because releasing a new version to pre-production requires significantly less time than releasing it to production, bugs can be detected earlier.
您甚至可以拥有多个预生产环境,从为每个工件从头开始创建并用于运行简单的烟雾测试的环境,到类似于生产的持久环境,从它接收一小部分镜像请求。例如,AWS 使用多个预生产环境(Alpha、Beta 和 Gamma)。
You can even have multiple pre-production environments, starting with one created from scratch for each artifact and used to run simple smoke tests, to a persistent one similar to production that receives a small fraction of mirrored requests from it. AWS, for example, uses multiple pre-production environments (Alpha, Beta, and Gamma).
发布到预生产环境的服务应该调用其外部依赖的生产端点,以使环境尽可能稳定;不过,它可以调用同一团队拥有的其他服务的预生产端点。
A service released to a pre-production environment should call the production endpoints of its external dependencies to make the environment as stable as possible; it could call the pre-production endpoints of other services owned by the same team, though.
理想情况下,CD 管道应使用与生产中使用的相同的运行状况信号来评估预生产中工件的运行状况。预生产中使用的指标、警报和测试应与生产中使用的指标、警报和测试相同,以避免前者成为健康覆盖率低于标准的二等公民。
Ideally, the CD pipeline should assess the artifact’s health in pre-production using the same health signals used in production. Metrics, alerts, and tests used in pre-production should be equivalent to those used in production to avoid the former to become a second-class citizen with sub-par health coverage.
一旦工件成功进入预生产阶段,CD 管道就可以进入最后阶段并将工件发布到生产环境。首先应该将其发布到少量生产实例2。目标是在尚未发现的问题有机会对生产造成广泛损害之前尽快发现它们。
Once an artifact has been rolled out to pre-production successfully, the CD pipeline can proceed to the final stage and release the artifact to production. It should start by releasing it to a small number of production instances at first2. The goal is to surface problems that haven’t been detected so far as quickly as possible before they have the chance to cause widespread damage in production.
如果进展顺利并且所有运行状况检查都通过,则该工件将逐步发布到队列的其余部分。在推出过程中,由于正在进行的部署,一部分机队无法提供任何流量,而其余实例则需要弥补这一不足。为了避免这种情况导致任何性能下降,需要留有足够的容量来维持增量释放。
If that goes well and all the health checks pass, the artifact is incrementally released to the rest of the fleet. While the rollout is in progress, a fraction of the fleet can’t serve any traffic due to the ongoing deployment, and the remaining instances need to pick up the slack. To avoid this causing any performance degradation, there needs to be enough capacity left to sustain the incremental release.
如果服务在多个区域可用,CD管道应首先从低流量区域开始,以减少错误发布的影响。其余地区的释放应分阶段进行,以进一步降低风险。当然,阶段越多,CD 管道将工件发布到生产环境所需的时间就越长。缓解此问题的一种方法是,一旦早期阶段成功完成并建立了足够的信心,就提高发布速度。例如,第一阶段可以将工件释放到单个区域,第二阶段可以释放到更大的区域,第三阶段可以同时释放到N个区域。
If the service is available in multiple regions, the CD pipeline should start with a low-traffic region first to reduce the impact of a faulty release. Releasing the remaining regions should be divided into sequential stages to minimize risks further. Naturally, the more stages there are, the longer the CD pipeline will take to release the artifact to production. One way to mitigate this problem is by increasing the release speed once the early stages complete successfully and enough confidence has been built up. For example, the first stage could release the artifact to a single region, the second to a larger region, and the third to N regions simultaneously.
在每个步骤之后,CD 管道需要评估部署的工件是否正常,否则停止发布并将其回滚。可以使用各种健康信号来做出该决定,例如:
After each step, the CD pipeline needs to assess whether the artifact deployed is healthy, or else stop the release and roll it back. A variety of health signals can be used to make that decision, such as:
仅监控正在推出的服务的运行状况信号是不够的。CD 管道还应监控上游和下游服务的运行状况,以检测推出的任何间接影响。管道应在一个步骤和下一步骤之间留出足够的时间(烘焙时间)以确保成功,因为某些问题只有在经过一段时间后才会出现。例如,性能下降仅在高峰时间才可见。
Monitoring just the health signals of the service being rolled out is not enough. The CD pipeline should also monitor the health of upstream and downstream services to detect any indirect impact of the rollout. The pipeline should allow enough time to pass between one step and the next (bake time) to ensure that it was successful, as some issues can appear only after some time has passed. For example, a performance degradation could be visible only at peak time.
CD 管道可以进一步控制特定 API 端点的请求数量的烘焙时间,以保证 API 表面得到正确的运用。为了加快释放速度,可以在每个步骤成功并建立信心后减少烘烤时间。
The CD pipeline can further gate the bake time on the number of requests seen for specific API endpoints to guarantee that the API surface has been properly exercised. To speed up the release, the bake time can be reduced after each step succeeds and confidence is built up.
当运行状况信号报告性能下降时,CD 管道就会停止。此时,它可以自动回滚工件,或者触发警报以吸引待命工程师,后者需要决定是否需要回滚3。根据他们的输入,CD 管道会重试失败的阶段(例如,可能是因为当时有其他东西正在投入生产),或者完全回滚版本。操作员还可以停止管道并等待带有修补程序的新工件向前滚动。如果由于引入了向后不兼容的更改而无法回滚版本,则这可能是必要的。
When a health signal reports a degradation, the CD pipeline stops. At that point, it can either roll back the artifact automatically, or trigger an alert to engage the engineer on-call, who needs to decide whether a rollback is warranted or not3. Based on their input, the CD pipeline retries the stage that failed (e.g., perhaps because something else was going into production at the time), or rolls back the release entirely. The operator can also stop the pipeline and wait for a new artifact with a hotfix to be rolled forward. This might be necessary if the release can’t be rolled back because a backward-incompatible change has been introduced.
由于前滚比回滚风险更大,因此根据经验,引入的任何更改都应始终向后兼容。向后不兼容的最常见原因是更改用于持久性或 IPC 目的的序列化格式。
Since rolling forward is much riskier than rolling back, any change introduced should always be backward compatible as a rule of thumb. The most common cause for backward-incompatibility is changing the serialization format used either for persistence or IPC purposes.
为了安全地引入向后不兼容的更改,需要将其分解为多个向后兼容的更改。例如,假设生产者和消费者服务之间的消息传递模式需要以向后不兼容的方式进行更改。在这种情况下,更改被分解为三个较小的更改,可以单独安全地回滚:
To safely introduce a backward-incompatible change, it needs to be broken down into multiple backward-compatible changes. For example, suppose the messaging schema between a producer and a consumer service needs to change in a backward incompatible way. In this case, the change is broken down into three smaller changes that can individually be rolled back safely:
预生产中 CD 管道的自动升级-降级测试部分可用于验证更改是否实际上可以安全回滚。
An automated upgrade-downgrade test part of the CD pipeline in pre-production can be used to validate whether a change is actually safe to roll back or not.
监控主要用于检测影响生产中用户的故障,并触发发送给负责缓解故障的操作人员的通知(警报)。监控的另一个关键用例是通过仪表板提供系统运行状况的高级概述。
Monitoring is primarily used to detect failures that impact users in production and trigger notifications (alerts) sent to human operators responsible for mitigating them. The other critical use case for monitoring is to provide a high-level overview of the system’s health through dashboards.
早期,监控主要用作黑盒方法来报告服务是正常运行还是关闭,而无法清楚地了解内部发生的情况。多年来,随着开发人员开始检测其代码以发出应用程序级测量结果,以回答特定功能是否按预期工作,它已发展成为一种白盒方法。随着 Etsy引入statsd ,这种方法得到了普及,它规范了应用程序级别测量的收集。
In the early days, monitoring was used mostly as a black-box approach to report whether a service was up or down, without much visibility of what was going on inside. Over the years, it has evolved into a white-box approach as developers started to instrument their code to emit application-level measurements to answer whether specific features worked as expected. This was popularized with the introduction of statsd by Etsy, which normalized collecting application-level measurements.
如今,黑盒监控仍在使用来监控外部依赖项(例如第三方 API),并验证用户如何从外部感知服务的性能和运行状况。一种常见的方法是定期运行向外部 API 端点发送测试请求的脚本,并监控它们花费的时间以及是否成功。这些脚本部署在应用程序用户所在的同一区域,并到达相同的端点。因为它们从外部使用系统的公共表面,所以它们可以捕获应用程序内部不可见的问题,例如连接问题。这些脚本对于检测用户不经常使用的 API 问题也很有用。
Blackbox monitoring is still in use today to monitor external dependencies, such as third-party APIs, and validate how the users perceive the performance and health of a service from the outside. A common approach is to periodically run scripts that send test requests to external API endpoints and monitor how long they took and whether they were successful. These scripts are deployed in the same regions the application’s users are and hit the same endpoints they do. Because they exercise the system’s public surface from the outside, they can catch issues that aren’t visible from within the application, like connectivity problems. These scripts are also useful to detect issues with APIs that aren’t exercised often by users.
黑盒监控擅长在出现问题时发现症状;相比之下,白盒监控可以帮助在用户受到影响之前识别已知硬故障模式的根本原因。根据经验,如果您无法设计消除硬故障模式,则应该为其添加监控。系统存在的时间越长,您就越能了解它如何发生故障以及需要监控哪些内容。
Blackbox monitoring is good at detecting the symptoms when something is broken; in contrast, white-box monitoring can help identify the root cause of known hard-failure modes before users are impacted. As a rule of thumb, if you can’t design away a hard-failure mode, you should add monitoring for it. The longer a system has been around, the better you will understand how it can fail and what needs to be monitored.
指标是在一定时间间隔内测量的信息的数字表示形式,并表示为时间序列,例如服务处理的请求数量。从概念上讲,指标是样本列表,其中每个样本由浮点数和时间戳表示。
A metric is a numeric representation of information measured over a time interval and represented as a time-series, like the number of requests handled by a service. Conceptually, a metric is a list of samples, where each sample is represented by a floating-point number and a timestamp.
现代监控系统允许使用一组称为标签的键值对来标记指标,这增加了指标的维度。本质上,每个不同的标签组合都是不同的指标。这已经成为一种必要,因为现代服务可以拥有与每个指标相关的大量元数据,例如数据中心、集群、节点、pod、服务等。高基数指标可以轻松地对数据进行切片和切块,并消除为每个标签组合手动创建指标的仪器成本。
Modern monitoring systems allow a metric to be tagged with a set of key-value pairs called labels, which increases the dimensionality of the metric. Essentially, every distinct combination of labels is a different metric. This has become a necessity as modern services can have a large amount of metadata associated with each metric, like datacenter, cluster, node, pod, service, etc. High-cardinality metrics make it easy to slice and dice the data, and eliminate the instrumentation cost of manually creating a metric for each label combination.
服务应该发出有关其负载、内部状态以及下游服务依赖项的可用性和性能的指标。结合下游服务发出的指标,这使运营商能够快速识别问题。这需要显式的代码更改以及开发人员刻意地努力来检测他们的代码。
A service should emit metrics about its load, internal state, and availability and performance of downstream service dependencies. Combined with the metrics emitted by downstream services, this allows operators to identify problems quickly. This requires explicit code changes and a deliberate effort by developers to instrument their code.
例如,采用一个返回资源的虚构 HTTP 处理程序。一旦在生产环境中运行,您将希望能够回答一系列问题1:
For example, take a fictitious HTTP handler that returns a resource. There is a whole range of questions you will want to be able to answer once it’s running in production1:
def get_resource(id):
resource = self._cache.get(id) # in-process cache
# Is the id valid?
# Was there a cache hit?
# How long has the resource been in the cache?
if resource is not None:
return resource
resource = self._repository.get(id)
# Did the remote call fail, and if so, why?
# Did the remote call timeout?
# How long did the call take?
self._cache[id] = resource
# What's the size of the cache?
return resource
# How long did it take for the handler to run?def get_resource(id):
resource = self._cache.get(id) # in-process cache
# Is the id valid?
# Was there a cache hit?
# How long has the resource been in the cache?
if resource is not None:
return resource
resource = self._repository.get(id)
# Did the remote call fail, and if so, why?
# Did the remote call timeout?
# How long did the call take?
self._cache[id] = resource
# What's the size of the cache?
return resource
# How long did it take for the handler to run?现在,假设我们想要记录我们的服务未能处理的请求数。实现此目的的一种方法是使用基于事件的方法 - 每当服务实例无法处理请求时,它就会向本地遥测代理报告事件2中的失败计数 1,例如:
Now, suppose we want to record the number of requests our service failed to handle. One way to do that is with an event-based approach — whenever a service instance fails to handle a request, it reports a failure count of 1 in an event2 to a local telemetry agent, e.g.:
{
"failureCount": 1,
"serviceRegion": "EastUs2",
"timestamp": 1614438079
}{
"failureCount": 1,
"serviceRegion": "EastUs2",
"timestamp": 1614438079
}代理对这些事件进行批处理,并定期将它们发送到远程遥测服务,该服务将它们保存在事件日志的专用数据存储中。例如,Azure Monitor基于日志的指标就采用了这种方法。
The agent batches these events and emits them periodically to a remote telemetry service, which persists them in a dedicated data store for event logs. For example, this is the approach taken by Azure Monitor’s log-based metrics.
正如您可以想象的那样,这是相当昂贵的,因为后端的负载随着摄取的事件数量的增加而增加。在查询时聚合事件的成本也很高——假设您想检索过去一个月北欧的故障数量;您必须发出一个查询,需要在该时间段内获取、过滤和聚合潜在的数万亿个事件。
As you can imagine, this is quite expensive since the load on the backend increases with the number of events ingested. Events are also costly to aggregate at query time — suppose you want to retrieve the number of failures in North Europe over the past month; you would have to issue a query that requires fetching, filtering, and aggregating potentially trillions of events within that time period.
有没有办法降低查询时的成本?由于指标是时间序列的,因此可以使用数学工具对它们进行建模和操作。时间序列的样本可以在预先指定的时间段(例如,1秒、5分钟、1小时等)内预先聚合,并用诸如总和、平均值或百分位数之类的汇总统计数据来表示。
Is there a way to reduce costs at query time? Because metrics are time-series, they can be modeled and manipulated with mathematical tools. The samples of a time-series can be pre-aggregated over pre-specified time periods (e.g., 1 second, 5 minutes, 1 hour, etc.) and represented with summary statistics such as the sum, average, or percentiles.
例如,遥测后端可以在摄取时预先聚合一个或多个时间段内的指标。从概念上讲,如果聚合(即我们示例中的总和)在一小时内发生,则每个serviceRegion都会有一个failureCount指标,每个指标每小时包含一个样本,例如:
For example, the telemetry backend can pre-aggregate metrics over one or more time periods at ingestion time. Conceptually, if the aggregation (i.e., the sum in our example) were to happen with a period of one hour, we would have one failureCount metric per serviceRegion, each containing one sample per hour, e.g.:
"00:00", 561,
"01:00", 42,
"02:00", 61,
..."00:00", 561,
"01:00", 42,
"02:00", 61,
...
后端可以创建多个不同周期的预聚合。然后在查询时,选择满足查询的最佳周期的预聚合指标。例如,CloudWatch(AWS 使用的遥测后端)会在获取数据时预先聚合数据。
The backend can create multiple pre-aggregates with different periods. Then at query time, the pre-aggregated metric with the best period that satisfies the query is chosen. For example, CloudWatch (the telemetry backend used by AWS) pre-aggregates data as it’s ingested.
我们可以进一步推进这一想法,并通过让本地遥测代理在客户端预先聚合指标来降低摄取成本。
We can take this idea one step further and also reduce ingestion costs by having the local telemetry agents pre-aggregate metrics on the client side.
客户端和服务器端预聚合大大降低了指标的带宽、计算和存储要求。然而,这是有代价的;操作员在摄取指标后失去了重新聚合指标的灵活性,因为他们无法再访问生成这些指标的原始事件。例如,如果指标在 1 小时的时间内预先聚合,则在没有原始事件的情况下,以后无法在 5 分钟的时间内重新聚合该指标。
Client and server-side pre-aggregation drastically reduces bandwidth, compute, and storage requirements for metrics. However, it comes at a cost; operators lose the flexibility to re-aggregate metrics after they have been ingested, as they no longer have access to the original events that generated them. For example, if a metric is pre-aggregated over a period of time of 1 hour, it can’t later be re-aggregated over a period of 5 min without the original events.
由于指标主要用于警报和可视化目的,因此它们通常以预聚合的形式保留在时间序列数据存储中,因为查询预聚合数据的效率可能比其他方法高几个数量级。
Because metrics are mainly used for alerting and visualization purposes, they are usually persisted in pre-aggregated form in a time-series data store since querying pre-aggregated data can be several order of magnitudes more efficient than the alternative.
如前所述,指标的主要用例之一是警报。这并不意味着我们应该为每一个可能的指标创建警报——例如,在半夜收到警报是没有用的,因为服务在几分钟前内存消耗出现了大幅峰值。在本节中,我们将讨论一个非常适合警报的特定指标类别。
As mentioned earlier, one of the main use cases for metrics is alerting. That doesn’t mean we should create alerts for every possible metric out there — for example, it’s useless to be alerted in the middle of the night because a service had a big spike in memory consumption a few minutes earlier. In this section, we will discuss one specific metric category that lends itself well for alerting.
服务级别指标(SLI) 是衡量服务向用户提供的服务级别的一个方面的指标,例如响应时间、错误率或吞吐量。SLI 通常在滚动时间窗口内聚合,并用汇总统计数据(例如平均值或百分位)表示。
A service-level indicator (SLI) is a metric that measures one aspect of the level of service provided by a service to its users, like the response time, error rate, or throughput. SLIs are typically aggregated over a rolling time window and represented with a summary statistic, like average or percentile.
SLI 最好用两个指标的比率来定义,即良好事件与事件总数的比率,因为它们很容易解释:0 表示服务已损坏,1 表示一切都按预期运行(参见图 20.1 )。正如我们将在本章后面看到的,比率还简化了警报的配置。
SLIs are best defined with a ratio of two metrics, good events over total number of events, since they are easy to interpret: 0 means the service is broken and 1 that everything is working as expected (see Figure 20.1). As we will see later in the chapter, ratios also simplify the configuration of alerts.
以下是一些常用的服务 SLI:
These are some commonly used SLIs for services:
图 20.1:SLI 定义为良好事件与事件总数的比率。
Figure 20.1: An SLI defined as the ratio of good events over the total number of events.
一旦决定了测量内容,您就需要决定在哪里测量。以响应时间为例。您应该使用服务、负载均衡器或客户端报告的指标吗?一般来说,您希望使用最能代表用户体验的一种。如果收集成本太高,请选择下一个最佳候选人。在上一个示例中,客户端指标更有意义,因为它考虑了整个请求路径中的延迟。
Once you have decided what to measure, you need to decide where to measure it. Take the response time, for example. Should you use the metric reported by the service, load balancer, or clients? In general, you want to use the one that best represents the experience of the users. And if that’s too costly to collect, pick the next best candidate. In the previous example, the client metric is the more meaningful one, as that accounts for delays in the entire path of the request.
现在,您应该如何衡量响应时间?测量结果可能会受到各种因素的影响,这些因素会增加其方差,例如网络超时、页面错误或频繁的上下文切换。由于每个请求所花费的时间并不相同,因此响应时间最好用分布来表示,该分布往往是右偏和长尾的。
Now, how should you measure response times? Measurements can be affected by various factors that increase their variance, such as network timeouts, page faults, or heavy context switching. Since every request does not take the same amount of time, response times are best represented with a distribution, which tend to be right-skewed and long-tailed.
分布可以用统计量来概括。以平均值为例。虽然它有其用途,但它并不能告诉您太多有关经历特定响应时间的请求的比例。只需要一个极端的异常值就会扭曲平均值。例如,如果有 100 个请求访问您的服务,其中 99 个请求的响应时间为 1 秒,其中一个请求的响应时间为 10 分钟,则平均值接近 7 秒。尽管 99% 的请求的响应时间为 1 秒,但平均值却比该时间高出 7 倍。
A distribution can be summarized with a statistic. Take the average, for example. While it has its uses, it doesn’t tell you much about the proportion of requests experiencing a specific response time. All it takes is one extreme outlier to skew the average. For example, if 100 requests are hitting your service, 99 of which have a response time of 1 second and one of 10 min, the average is nearly 7 seconds. Even though 99% of the requests experience a response time of 1 second, the average is 7 times higher than that.
表示响应时间分布的更好方法是使用百分位数。百分位是响应时间百分比低于该值的值。例如,如果第 99 个百分位数为 1 秒,则 99% 的请求的响应时间低于或等于 1 秒。响应时间分布的上百分位数(例如第 99 个百分位数和第 99.9 个百分位数)也称为长尾延迟。一般来说,分布的方差越高,普通用户就越有可能受到长尾行为的影响3。
A better way to represent the distribution of response times is with percentiles. A percentile is the value below which a percentage of the response times fall. For example, if the 99th percentile is 1 second, then 99 % of requests have a response time below or equal to 1 second. The upper percentiles of a response time distribution, like the 99th and 99.9th percentiles, are also called long-tail latencies. In general, the higher the variance of a distribution is, the more likely the average user will be affected by long-tail behavior3.
尽管只有一小部分请求经历了这些极端的延迟,但它会影响最有利可图的用户。它们发出的请求数量最多,因此更有可能遇到尾部延迟。多项研究表明,高延迟会对收入产生负面影响。加载时间仅仅 100 毫秒的延迟就会使转化率降低 7%。
Even though only a small fraction of requests experience these extreme latencies, it impacts your most profitable users. They are the ones that make the highest number of requests and thus have a higher chance of experiencing tail latencies. Several studies have shown that high latencies can negatively affect revenues. A mere 100-millisecond delay in load time can hurt conversion rates by 7 percent.
此外,不加控制的长尾行为很快就会导致服务瘫痪。假设某个服务使用 2K 线程每秒处理 10K 请求。根据利特尔定律,线程的平均响应时间为 200 毫秒。突然,网络交换机变得拥塞,并且碰巧有 1% 的请求由该交换机后面的节点提供服务。这 1% 的请求,或每秒 10K 中的 100 个请求,开始需要 20 秒才能完成。
Also, long-tail behaviors left unchecked can quickly bring a service to its knees. Suppose a service is using 2K threads to serve 10K requests per second. By Little’s Law, the average response time of a thread is 200 ms. Suddenly, a network switch becomes congested, and as it happens, 1% of requests are being served from a node behind that switch. That 1% of requests, or 100 requests per second out of the 10K, starts taking 20 seconds to complete.
服务还需要多少个线程来处理响应时间较长的一小部分请求?如果每秒 100 个请求需要 20 秒来处理,那么就需要 2K 额外的线程来处理缓慢的请求。因此,服务使用的线程数需要加倍才能跟上负载!
How many more threads does the service need to deal with the small fraction of requests having a high response time? If 100 requests per second take 20 seconds to process, then 2K additional threads are needed to deal just with the slow requests. So the number of threads used by the service needs to double to keep up with the load!
衡量长尾行为并对其进行检查不仅可以让您的用户满意,还可以大大提高服务的弹性并降低运营成本。当你被迫防范最坏情况的长尾行为时,你碰巧也会改善平均情况。
Measuring long-tail behavior and keeping it under check doesn’t just make your users happy, but also drastically improves the resiliency of your service and reduces operational costs. When you are forced to guard against the worst-case long-tail behavior, you happen to improve the average case as well.
服务级别目标( SLO) 定义了 SLI 的可接受值范围,在该范围内服务被认为处于健康状态(参见图20.2)。SLO 为其用户设定了服务正常运行时应如何表现的期望。服务所有者还可以使用 SLO 与其用户定义服务级别协议 (SLA),这是一种合同协议,规定不满足 SLO 时会发生什么情况,通常会导致财务后果。
A service-level objective (SLO) defines a range of acceptable values for an SLI within which the service is considered to be in a healthy state (see Figure 20.2). An SLO sets the expectation to its users of how the service should behave when it’s functioning correctly. Service owners can also use SLOs to define a service-level agreement (SLA) with their users — a contractual agreement that dictates what happens when an SLO isn’t met, typically resulting in financial consequences.
例如,SLO 可以定义对端点 X 的 99% API 调用应在 200 毫秒以下完成(在 1 周的滚动窗口中测量)。另一种看待方式是,滚动周内最多 1% 的请求延迟高于 200 毫秒是可以接受的。这1%也称为错误预算,代表可以容忍的故障数量。
For example, an SLO could define that 99% of API calls to endpoint X should complete below 200 ms, as measured over a rolling window of 1 week. Another way to look at it, is that it’s acceptable for up to 1% of requests within a rolling week to have a latency higher than 200 ms. That 1% is also called the error budget, which represents the number of failures that can be tolerated.
图 20.2:SLO 定义了 SLI 的可接受值的范围。
Figure 20.2: An SLO defines the range of acceptable values for an SLI.
SLO 有助于发出警报,并帮助团队根据功能工作确定修复任务的优先级。例如,团队可以同意,当错误预算耗尽时,修复项目将优先于新功能,直到恢复 SLO。此外,事件的重要性可以通过错误预算的消耗量来衡量。消耗了 20% 错误预算的事件比仅消耗了 1% 的错误预算的事件需要更多的事后考虑。
SLOs are helpful for alerting purposes and help the team prioritize repair tasks with feature work. For example, the team can agree that when an error budget has been exhausted, repair items will take precedence over new features until the SLO is restored. Also, an incident’s importance can be measured by how much of the error budget has been burned. An incident that burned 20% of the error budget needs more afterthought than one that burned only 1%.
较小的时间窗口迫使团队更快地采取行动,并优先考虑错误修复和修复项目,而较长的时间窗口更适合对投资哪些项目做出长期决策。因此,拥有多个具有不同窗口大小的 SLO 是有意义的。
Smaller time windows force the team to act quicker and prioritize bug fixes and repair items, while longer windows are better suited to make long-term decisions about which projects to invest in. Therefore it makes sense to have multiple SLOs with different window sizes.
SLO 应该有多严格?选择正确的目标范围比看起来更难。如果太松,您将无法检测到用户面临的问题;如果太严格,您将浪费工程时间进行微优化并获得收益递减。即使您可以保证系统 100% 的可靠性,您也无法保证用户访问您的服务所依赖的任何超出您控制范围的内容,例如最后一英里连接。因此,100% 的可靠性并不意味着用户体验 100% 可靠。
How strict should SLOs be? Choosing the right target range is harder than it looks. If it’s too loose, you won’t detect user-facing issues; if it’s too strict, you will waste engineering time micro-optimizing and get diminishing returns. Even if you could guarantee 100% reliability for your system, you can’t make guarantees for anything that your users depend on to access your service that is outside your control, like their last-mile connection. Thus, 100% reliability doesn’t translate into a 100% reliable experience for users.
在为 SLO 设置目标范围时,从舒适的范围开始,并在建立信心时收紧它们。不要只选择您的服务今天满足的目标,而这些目标可能在负载增加一年后就无法实现;从用户关心的事情开始逆向工作。一般来说,任何超过 3 个 9 的可用性的实现成本都非常高,并且回报递减。
When setting the target range for your SLOs, start with comfortable ranges and tighten them as you build up confidence. Don’t just pick targets that your service meets today that might become unattainable in a year after the load increases; work backward from what users care about. In general, anything above 3 nines of availability is very costly to achieve and provides diminishing returns.
您应该有多少个 SLO?您应该努力使事情保持简单,并尽可能少地提供足够好的指示所需的服务级别。SLO 还应该记录并定期审查。例如,假设您发现某个面向用户的特定问题生成了大量支持票证,但您的 SLO 均未显示出任何降级。在这种情况下,他们要么太放松,要么你没有测量你应该测量的东西。
How many SLOs should you have? You should strive to keep things simple and have as few as possible that provide a good enough indication of the desired service level. SLOs should also be documented and reviewed periodically. For example, suppose you discover that a specific user-facing issue generated lots of support tickets, but none of your SLOs showed any degradations. In that case, they are either too relaxed, or you are not measuring something that you should.
SLO 需要与多个利益相关者达成一致。工程师们需要同意这些目标无需过多的努力就可以实现。如果错误预算消耗得太快或已经耗尽,修复项目将优先于功能。产品经理必须同意这些目标保证了良好的用户体验。正如 Google 的SRE 书中提到的那样:“如果您无法通过引用特定的 SLO 来赢得有关优先级的对话,那么可能就不值得拥有该 SLO。”
SLOs need to be agreed on with multiple stakeholders. Engineers need to agree that the targets are achievable without excessive toil. If the error budget is burning too rapidly or has been exhausted, repair items will take priority over features. Product managers have to agree that the targets guarantee a good user experience. As Google’s SRE book mentions: “if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.”
用户可能会过度依赖服务的实际行为,而不是已发布的 SLO。为了缓解这种情况,您可以考虑在生产中注入受控故障(也称为混沌测试)以“动摇树”并确保依赖项能够应对目标服务级别并且不会做出不切实际的假设。作为一个额外的好处,注入错误有助于验证弹性机制是否按预期工作。
Users can become over-reliant on the actual behavior of your service rather than the published SLO. To mitigate that, you can consider injecting controlled failures in production — also known as chaos testing — to “shake the tree” and ensure the dependencies can cope with the targeted service level and are not making unrealistic assumptions. As an added benefit, injecting faults helps validate that resiliency mechanisms work as expected.
警报是监控系统的一部分,当特定条件发生时(例如指标超过阈值),它会触发操作。根据警报的严重性和类型,触发的操作范围可以从运行一些自动化(例如重新启动服务实例)到拨打待命操作员的电话。在本节的其余部分中,我们将主要关注后一种情况。
Alerting is the part of a monitoring system that triggers an action when a specific condition happens, like a metric crossing a threshold. Depending on the severity and the type of the alert, the action triggered can range from running some automation, like restarting a service instance, to ringing the phone of a human operator who is on-call. In the rest of this section, we will be mostly focusing on the latter case.
为了使警报有用,它必须是可操作的。操作员不应花时间深入仪表板来评估警报的影响和紧迫性。例如,发出 CPU 使用率激增信号的警报没有用,因为如果不进一步调查,就不清楚它是否对系统有任何影响。另一方面,SLO 是警报的良好候选者,因为它量化了其对用户的影响。可以监控 SLO 的错误预算,以便在消耗了大部分错误预算时触发警报。
For an alert to be useful, it has to be actionable. The operator shouldn’t spend time digging into dashboards to assess the alert’s impact and urgency. For example, an alert signaling a spike in CPU usage is not useful as it’s not clear whether it has any impact on the system without further investigation. On the other hand, an SLO is a good candidate for an alert because it quantifies its impact on the users. The SLO’s error budget can be monitored to trigger an alert whenever a large fraction of it has been consumed.
在我们讨论如何定义警报之前,重要的是要了解警报的精确度和召回率之间存在权衡。从形式上来说,准确率是重要事件占警报总数的比例,而召回率是触发警报的重要事件的比率。精度低的警报噪音很大,而且通常不可操作,而召回率低的警报并不总是在中断期间触发。虽然拥有 100% 的准确率和召回率固然很好,但您必须做出权衡,因为改进其中一个通常会降低另一个。
Before we can discuss how to define an alert, it’s important to understand that there is a trade-off between its precision and recall. Formally, precision is the fraction of significant events over the total number of alerts, while recall is the ratio of significant events that triggered an alert. Alerts with low precision are noisy and often not actionable, while alerts with low recall don’t always trigger during an outage. Although it would be nice to have 100% precision and recall, you have to make a trade-off since improving one typically lowers the other.
假设您在 30 天内的可用性 SLO 为 99%,并且您希望为其配置警报。一种简单的方法是,每当可用性在相对较短的时间窗口(例如一个小时)内低于 99% 时,就触发警报。但是,当警报触发时,实际上已经消耗了多少错误预算呢?
Suppose you have an availability SLO of 99% over 30 days, and you would like to configure an alert for it. A naive way would be to trigger an alert whenever the availability goes below 99% within a relatively short time window, like an hour. But how much of the error budget has actually been burned by the time the alert triggers?
由于警报的时间窗口为一小时,并且 SLO 错误预算定义为 30 天,因此警报触发时已花费的错误预算百分比为。收到 SLO 0.14% 错误预算已被销毁的通知真的很重要吗?可能不会。在这种情况下,召回率很高,但准确率很低。
Because the time window of the alert is one hour, and the SLO error budget is defined over 30 days, the percentage of error budget that has been spent when the alert triggers is . Is it really critical to be notified that 0.14% of the SLO’s error budget has been burned? Probably not. In this case, you have high recall, but low precision.
您可以通过增加警报条件为真的时间来提高警报的精确度。问题在于,现在触发警报需要更长的时间,这在实际中断时将成为问题。另一种方法是根据错误预算消耗的速度(也称为消耗率)发出警报,这会缩短检测时间。
You can improve the alert’s precision by increasing the amount of time its condition needs to be true. The problem with it is that now the alert will take longer to trigger, which will be an issue when there is an actual outage. The alternative is to alert based on how fast the error budget is burning, also known as the burn rate, which lowers the detection time.
燃烧率定义为消耗的错误预算占已用 SLO 时间窗口百分比的百分比,即错误预算的增长率。具体来说,对于我们的 SLO 示例,燃烧率为 1 意味着错误预算将在 30 天内耗尽;如果费率为2,则为15天;如果费率为3,则为10天,依此类推。
The burn rate is defined as the percentage of the error budget consumed over the percentage of the SLO time window that has elapsed — it’s the rate of increase of the error budget. Concretely, for our SLO example, a burn rate of 1 means the error budget will be exhausted precisely in 30 days; if the rate is 2, then it will be 15 days; if the rate is 3, it will be 10 days, and so on.
通过重新排列消耗率方程,您可以得出当错误预算的特定百分比被消耗时触发的警报阈值。例如,要在一小时窗口内燃烧 2% 的错误预算时触发警报,燃烧率的阈值应设置为 14.4:
By rearranging the burn rate’s equation, you can derive the alert threshold that triggers when a specific percentage of the error budget has been burned. For example, to have an alert trigger when an error budget of 2% has been burned in a one-hour window, the threshold for the burn rate should be set to 14.4:
为了提高召回率,您可以设置多个具有不同阈值的警报。例如,烧钱率低于 2 可能是低严重性警报,会发送电子邮件并在工作时间进行调查。SRE 工作簿提供了一些关于如何根据消耗率配置警报的精彩示例。
To improve recall, you can have multiple alerts with different thresholds. For example, a burn rate below 2 could be a low-severity alert that sends an e-mail and is investigated during working hours. The SRE workbook has some great examples of how to configure alerts based on burn rates.
虽然您应该根据 SLO 定义大部分警报,但有些警报应该针对您没有时间设计或调试的已知硬故障模式触发。例如,假设您知道您的服务遭受内存泄漏,导致过去发生过事故,但您尚未设法追踪根本原因或构建弹性机制来缓解该问题。在这种情况下,定义一个警报,在服务实例内存不足时触发自动重新启动可能会很有用。
While you should define most of your alerts based on SLOs, some should trigger for known hard-failure modes that you haven’t had the time to design or debug away. For example, suppose you know your service suffers from a memory leak that has led to an incident in the past, but you haven’t managed yet to track down the root-cause or build a resiliency mechanism to mitigate it. In this case, it could be useful to define an alert that triggers an automated restart when a service instance is running out of memory.
发出警报后,指标的另一个主要用例是为显示系统整体运行状况的实时仪表板提供支持。
After alerting, the other main use case for metrics is to power real-time dashboards that display the overall health of a system.
不幸的是,仪表板很容易成为图表的垃圾场,这些图表最终会被遗忘,有用性值得怀疑,或者只是令人困惑。好的仪表板并不是偶然出现的。在本节中,我将介绍一些有关如何创建有用的仪表板的最佳实践。
Unfortunately, dashboards can easily become a dumping ground for charts that end up being forgotten, have questionable usefulness, or are just plain confusing. Good dashboards don’t happen by coincidence. In this section, I will present some best practices on how to create useful dashboards.
创建仪表板时必须做出的第一个决定是确定受众是谁以及他们在寻找什么。考虑到受众,您可以向后决定要包含哪些图表以及指标。
The first decision you have to make when creating a dashboard is to decide who the audience is and what they are looking for. Given the audience, you can work backward to decide which charts, and therefore metrics, to include.
此处显示的仪表板类别(参见图20.3)绝不是标准的,但应该让您了解如何组织仪表板。
The categories of dashboards presented here (see Figure 20.3) are by no means standard but should give you an idea of how to organize dashboards.
图 20.3:仪表板应根据受众量身定制。
Figure 20.3: Dashboards should be tailored to their audience.
SLO 仪表板
SLO dashboard
SLO 摘要仪表板旨在供整个组织中的各个利益相关者使用,以了解由其 SLO 表示的系统运行状况。在发生事件期间,此仪表板会量化其对用户的影响。
The SLO summary dashboard is designed to be used by various stakeholders from across the organization to gain visibility into the system’s health as represented by its SLOs. During an incident, this dashboard quantifies the impact it’s having on users.
公共 API 仪表板
Public API dashboard
该仪表板显示有关系统公共 API 端点的指标,这有助于操作员识别事件期间有问题的路径。对于每个端点,仪表板公开了与请求消息、请求处理和响应消息相关的多个指标,例如:
This dashboard displays metrics about the system’s public API endpoints, which helps operators identifying problematic paths during an incident. For each endpoint, the dashboard exposes several metrics related to request messages, request handling and response messages, like:
服务仪表板
Service dashboard
服务仪表板显示特定于服务的实现细节,这需要深入了解其内部工作原理。与之前的仪表板不同,该仪表板主要由拥有该服务的团队使用。
A service dashboard displays service-specific implementation details, which require a deep understanding of its inner workings. Unlike the previous dashboards, this one is primarily used by the team that owns the service.
除了特定于服务的指标之外,服务仪表板还应该包含上游依赖项(例如负载均衡器和消息队列)以及下游依赖项(例如数据存储)的指标。
Beyond service-specific metrics, a service dashboard should also contain metrics for upstream dependencies like load balancers and messaging queues, and downstream dependencies like data stores.
该仪表板提供了调试时服务行为的第一个入口点。正如我们稍后在讨论可观察性时将了解到的那样,这种高级视图只是起点。操作员通常通过进一步细分来深入了解指标,并最终获取原始日志和跟踪以获得更多详细信息。
This dashboard offers a first entry point into the behavior of a service when debugging. As we will later learn when discussing observability, this high-level view is just the starting point. The operator typically drills down into the metrics by segmenting them further, and eventually reaches for raw logs and traces to get more detail.
随着新指标的添加和旧指标的删除,图表和仪表板需要进行修改,并在登台和生产等多个环境中保持同步。实现这一目标的最有效方法是使用特定于领域的语言定义仪表板和图表,并像代码一样对它们进行版本控制。这允许从包含相关代码更改的同一拉取请求更新仪表板,而无需手动更新仪表板,而手动更新仪表板很容易出错。
As new metrics are added and old ones removed, charts and dashboards need to be modified and be kept in-sync across multiple environments like staging and production. The most effective way to achieve that is by defining dashboards and charts with a domain-specific language and version-control them just like code. This allows updating dashboards from the same pull request that contains related code changes without needing to update dashboards manually, which is error-prone.
当仪表板从上到下呈现时,最重要的图表应始终位于最顶部。
As dashboards render top to bottom, the most important charts should always be located at the very top.
图表应使用默认时区(例如 UTC)来呈现,以方便位于世界不同地区的人们在查看相同数据时进行交流。
Charts should be rendered with a default timezone, like UTC, to ease the communication between people located in different parts of the world when looking at the same data.
同样,同一仪表板中的所有图表应使用相同的时间分辨率(例如,1 分钟、5 分钟、1 小时等)和范围(24 小时、7 天等)。这样可以轻松地在同一仪表板中直观地关联跨图表的异常情况。您应该根据仪表板最常见的用例选择默认时间范围和分辨率。例如,具有 1 分钟分辨率的 1 小时范围最适合监控正在进行的事件,而具有 1 天分辨率的 1 年范围最适合容量规划。
Similarly, all charts in the same dashboard should use the same time resolution (e.g., 1 min, 5 min, 1 hour, etc.) and range (24 hours, 7 days, etc.). This makes it easy to correlate anomalies across charts in the same dashboard visually. You should pick the default time range and resolution based on the most common use case for a dashboard. For example, a 1-hour range with a 1-min resolution is best to monitor an ongoing incident, while a 1-year range with a 1-day resolution is best for capacity planning.
您应该将同一图表上的数据点和指标的数量保持在最低限度。渲染太多点不仅会减慢下载图表的速度,还会使图表难以解释和发现异常情况。
You should keep the number of data points and metrics on the same chart to a minimum. Rendering too many points doesn’t just slow downloading charts, but also makes them hard to interpret and spot anomalies.
图表应仅包含具有相似范围(最小值和最大值)的指标;否则,具有最大范围的度量可以完全隐藏其他具有较小范围的度量。因此,将同一指标的相关统计数据拆分到多个图表中是有意义的。例如,指标的第 10 个百分位数、平均值和第 90 个百分位数可以显示在一个图表中,而第 0.1 个百分位数、第 99.9 个百分位数、最小值和最大值则显示在另一个图表中。
A chart should contain only metrics with similar ranges (min and max values); otherwise, the metric with the largest range can completely hide the others with smaller ranges. For that reason, it makes sense to split related statistics for the same metric into multiple charts. For example, the 10th percentile, average and 90th percentile of a metric can be displayed in one chart, while the 0.1th percentile, 99.9th percentile, minimum and maximum in another.
图表还应该包含有用的注释,例如:
A chart should also contain useful annotations, like:
仅在发生错误情况时发出的指标可能很难解释,因为图表将显示数据点之间的巨大差距,使操作员想知道服务是否由于错误而停止发出该指标。为了避免这种情况,请在不存在错误时使用值 0 发出度量,在存在错误时使用值 1 发出度量。
Metrics that are only emitted when an error condition occurs can be hard to interpret as charts will show wide gaps between the data points, leaving the operator wondering whether the service stopped emitting that metric due to a bug. To avoid this, emit a metric using a value of zero in the absence of an error and a value of 1 in the presence of it.
只有从头开始构建服务并考虑到可靠性和可操作性,才能实现健康的随叫随到轮换。通过让开发人员负责运营他们所构建的内容,他们就会受到激励,将运营成本降至最低。他们也最适合随叫随到,因为他们非常熟悉系统的架构、砖墙和权衡。
A healthy on-call rotation is only possible when services are built from the ground up with reliability and operability in mind. By making the developers responsible for operating what they build, they are incentivized to reduce the operational toll to a minimum. They are also in the best position to be on-call since they are intimately familiar with the system’s architecture, brick walls, and trade-offs.
随叫随到可能会带来很大的压力。即使没有上岗,一想到在正常工作时间之外没有通常享有的自由就会引起焦虑。这就是为什么待命的工程师应该得到补偿,并且不应该期望待命的工程师在功能工作上取得任何进展。由于他们会被警报打断,因此他们应该充分利用它并自由地改善待命体验,例如,通过修改仪表板或改进弹性机制。
Being on-call can be very stressful. Even when there are no call-outs, just the thought of not having the same freedom usually enjoyed outside of regular working hours can cause anxiety. This is why being on-call should be compensated, and there shouldn’t be any expectations for the on-call engineer to make any progress on feature work. Since they will be interrupted by alerts, they should make the most out of it and be given free rein to improve the on-call experience, for example, by revising dashboards or improving resiliency mechanisms.
只有当警报可操作时,才能实现健康的待命状态。当警报触发时,至少应该链接到相关的仪表板和列出工程师应采取的操作的操作手册,因为当您在半夜接到电话时很容易错过任何步骤4 . 除非警报是误报,否则操作员采取的所有操作都应传达到其他团队可以访问的共享渠道(例如全局聊天)中。这使得其他人能够插话、跟踪事件的进展,并更容易地将正在进行的事件移交给其他人。
Achieving a healthy on-call is only possible when alerts are actionable. When an alert triggers, to the very least, it should link to relevant dashboards and a run-book that lists the actions the engineer should take, as it’s all too easy to miss a step when you get a call in the middle of the night4. Unless the alert was a false positive, all actions taken by the operator should be communicated into a shared channel like a global chat, that’s accessible by other teams. This allows others to chime in, track the incident’s progress, and make it easier to hand over an ongoing incident to someone else.
解决警报的第一步是缓解警报,而不是修复产生警报的根本原因。推出的新工件会降低服务质量?将其滚回来。即使负载没有增加,服务也无法应对?扩大规模。
The first step to address an alert is to mitigate it, not fix the underlying root cause that created it. A new artifact has been rolled out that degrades the service? Roll it back. The service can’t cope with the load even though it hasn’t increased? Scale it out.
一旦事件得到缓解,下一步就是集思广益,以防止其再次发生。影响越广泛,您应该在这方面花费更多的时间。消耗 SLO 错误预算很大一部分的事件需要进行事后分析。
Once the incident has been mitigated, the next step is to brainstorm ways to prevent it from happening again. The more widespread the impact was, the more time you should spend on this. Incidents that burned a significant fraction of an SLO’s error budget require a postmortem.
事后分析的目标是了解事件的根本原因,并提出一套修复措施来防止事件再次发生。团队中还应该达成一致,如果 SLO 的错误预算被烧毁或警报数量失控,整个团队将停止开发新功能,只专注于可靠性,直到恢复健康的随叫随到轮换。
A postmortem’s goal is to understand an incident’s root cause and come up with a set of repair items that will prevent it from happening again. There should also be an agreement in the team that if an SLO’s error budget is burned or the number of alerts spirals out of control, the whole team stops working on new features to focus exclusively on reliability until a healthy on-call rotation has been restored.
SRE书籍提供了有关建立健康的 on-call 轮换的大量信息和最佳实践。
The SRE books provide a wealth of information and best practices regarding setting up a healthy on-call rotation.
为了简单起见,我省略了错误处理↩︎
I have omitted error handling for simplicity↩︎
我们将在第 21.1节中详细讨论事件日志,现在假设事件只是一个字典。↩︎
We will talk more about event logs in section 21.1, for now assume an event is just a dictionary.↩︎
这往往主要是由请求-响应路径中的各种队列引起的。↩︎
This tends to be primarily caused by various queues in the request-response path.↩︎
出于同样的原因,您应该尽可能自动化,以尽量减少操作员需要执行的手动操作。机器善于遵循指令;利用它来发挥你的优势。↩︎
For the same reason, you should automate what you can to minimize manual actions that operators need to perform. Machines are good at following instructions; use that to your advantage.↩︎
分布式系统在任何给定时间都不是 100% 健康的,因为总会出现故障。得益于宽松的一致性模型和速率限制、重试和断路器等弹性机制,可以容忍各种故障模式。不幸的是,它们也增加了系统的复杂性。随着复杂性的增加,推理系统可能经历的大量紧急行为变得越来越困难。
A distributed system is never 100% healthy at any given time as there can always be something failing. A whole range of failure modes can be tolerated, thanks to relaxed consistency models and resiliency mechanisms like rate limiting, retries, and circuit breakers. Unfortunately, they also increase the system’s complexity. And with more complexity, it becomes increasingly harder to reason about the multitude of emergent behaviours the system might experience.
正如我们所讨论的,人类操作员仍然是运营服务的基本组成部分,因为有些事情无法自动化,例如缓解事件。调试是此类任务的另一个示例。当一个系统被设计为能够容忍一定程度的退化和自我修复时,就没有必要也不可能监控它进入不健康状态的每一种方式。您仍然需要工具和仪器来调试复杂的紧急故障,因为它们无法预先预测。
As we have discussed, human operators are still a fundamental part of operating a service as there are things that can’t be automated, like mitigating an incident. Debugging is another example of such a task. When a system is designed to tolerate some level of degradation and self-heal, it’s not necessary or possible to monitor every way it can get into an unhealthy state. You still need tooling and instrumentation to debug complex emergent failures because they are impossible to predict up-front.
调试时,操作员做出假设并尝试验证它。例如,操作员在注意到其服务响应时间的方差在过去几周缓慢但稳定地增加后可能会产生怀疑,这表明某些请求比其他请求花费的时间要长得多。将方差的增加与流量的增加关联起来后,运营商假设服务越来越接近限制,例如限制或资源争用。仅靠指标和图表无助于验证这一假设。
When debugging, the operator makes an hypothesis and tries to validate it. For example, the operator might get suspicious after noticing that the variance of her service’s response time has increased slowly but steadily over the past weeks, indicating that some requests take much longer than others. After correlating the increase in variance with an increase in traffic, the operator hypothesizes that the service is getting closer to hitting a constraint, like a limit or a resource contention. Metrics and charts alone won’t help to validate this hypothesis.
可观察性是一组工具,可以提供对生产系统的精细洞察,使我们能够了解其紧急行为。一个好的可观察性平台致力于最大限度地减少验证假设所需的时间。这需要具有丰富上下文的细粒度事件,因为不可能预先知道什么在未来有用。
Observability is a set of tools that provide granular insights into a system in production, allowing us to understand its emergent behaviours. A good observability platform strives to minimize the time it takes to validate hypotheses. This requires granular events with rich contexts, since it’s not possible to know up-front what’s going to be useful in the future.
在可观察性的核心,我们找到指标、事件日志和跟踪等遥测源。指标存储在具有高吞吐量的时间序列数据存储中,但很难处理具有多个维度的指标。相反,事件日志和跟踪最终会存储在事务性存储中,这些存储可以很好地处理高维数据,但难以应对高吞吐量。指标主要用于监控,而事件日志和跟踪主要用于调试。
At the core of observability, we find telemetry sources like metrics, event logs, and traces. Metrics are stored in time-series data stores that have high throughput, but struggle to deal with metrics that have many dimensions. Conversely, event logs and traces end up in transactional stores that can handle high-dimensional data well, but struggle with high throughput. Metrics are mainly used for monitoring, while event logs and traces mainly for debugging.
可观察性是监控的超集。虽然监控仅专注于跟踪系统的运行状况,但可观察性还提供了理解和调试系统的工具。监控本身可以很好地检测故障症状,但无法解释其根本原因(见图21.1)。
Observability is a superset of monitoring. While monitoring is focused exclusively on tracking the health of a system, observability also provides tools to understand and debug it. Monitoring on its own is good at detecting failure symptoms, but less so to explain their root cause (see Figure 21.1).
图 21.1:可观察性是监控的超集。
Figure 21.1: Observability is a superset of monitoring.
日志是一段时间内发生的带有时间戳的事件的不可变列表。一个事件可以有不同的形式。最简单的形式就是自由格式的文本。它还可以用文本格式(如 JSON)或二进制格式(如 Protobuf)进行结构化和表示。当结构化时,事件通常用一包键值对表示:
A log is an immutable list of time-stamped events that happened over time. An event can have different formats. In its simplest form, it’s just free-form text. It can also be structured and represented with a textual format like JSON, or a binary one like Protobuf. When structured, an event is typically represented with a bag of key-value pairs:
{
"failureCount": 1,
"serviceRegion": "EastUs2",
"timestamp": 1614438079
}{
"failureCount": 1,
"serviceRegion": "EastUs2",
"timestamp": 1614438079
}日志可以源自您的服务和外部依赖项,例如消息代理、代理、数据库等。大多数语言都提供可以轻松发出结构化日志的库。日志通常转储到磁盘文件中,这些文件经常轮换,并由代理异步转发到外部日志收集器,例如 ELK堆栈或 AWS CloudWatch 日志。
Logs can originate from your services and external dependencies, like message brokers, proxies, databases, etc. Most languages offer libraries that make it easy to emit structured logs. Logs are typically dumped to disk files, which are rotated every so often, and forwarded by an agent to an external log collector asynchronously, like an ELK stack or AWS CloudWatch logs.
日志提供了有关服务中发生的所有事情的大量信息。它们对于调试目的特别有用,因为它们允许我们从症状(例如服务实例崩溃)追溯到根本原因。它们还有助于调查以平均值和百分位数表示的指标所遗漏的长尾行为,这无助于解释特定用户请求失败的原因。
Logs provide a wealth of information about everything that’s happening in a service. They are particularly helpful for debugging purposes, as they allow us to trace back the root cause from a symptom, like a service instance crash. They also help to investigate long-tail behaviors that are missed by metrics represented with averages and percentiles, which can’t help explain why a specific user request is failing.
日志非常容易发出,尤其是自由格式的文本日志。但这几乎是它们与指标和其他仪表工具相比的唯一优势。如果误用,日志库可能会增加服务的开销,特别是当它们不是异步的并且在写入标准输出或磁盘时会阻塞日志记录时。此外,如果磁盘由于日志记录过多而填满,服务实例可能会进入降级状态。最好的情况是,您会丢失日志记录;最坏的情况是,如果服务实例需要磁盘访问来处理请求,它就会停止工作。
Logs are very simple to emit, particularly free-form textual ones. But that’s pretty much the only advantage they have compared to metrics and other instrumentation tools. Logging libraries can add overhead to your services if misused, especially when they are not asynchronous and logging blocks while writing to stdout or disk. Also, if the disk fills up due to excessive logging, the service instance might get itself into a degraded state. At best, you lose logging, and at worst, the service instance stops working if it requires disk access to handle requests.
无论您计划在内部还是使用第三方服务,摄取、处理和存储大量数据也不便宜。尽管结构化二进制日志比文本日志更有效,但由于其高维度,它们仍然很昂贵。
Ingesting, processing, and storing a massive trove of data is not cheap either, no matter whether you plan to do this in-house or use a third-party service. Although structured binary logs are more efficient than textual ones, they are still expensive due to their high dimensionality.
最后,但同样重要的是,日志具有较高的信噪比,因为它们是细粒度的且特定于服务的,这使得从日志中提取有用信息变得具有挑战性。
Finally, but not less important, logs have a high noise to signal ratio because they are fine-grained and service-specific, which makes it challenging to extract useful information from them.
最佳实践
Best Practices
为了减轻工程师钻探日志的工作的痛苦,有关特定工作单元的所有数据应存储在单个事件中。工作单元通常对应于从队列中拉出的请求或消息。为了有效地实现此模式,处理工作单元的代码路径需要传递包含正在构建的事件的上下文对象。
To make the job of the engineer drilling into the logs less painful, all the data about a specific work unit should be stored in a single event. A work unit typically corresponds to a request or a message pulled from a queue. To effectively implement this pattern, code paths handling work units need to pass around a context object containing the event being built.
事件应该包含有关工作单元的有用信息,例如谁创建了它、它的用途以及它是成功还是失败。它还应该包括测量结果,例如特定操作花费的时间。工作单元内执行的每个网络调用都需要进行检测以记录其响应状态代码和响应时间。最后,应该对记录到事件的数据进行清理,并删除开发人员不应访问的潜在敏感属性,例如用户内容。
An event should contain useful information about the work unit, like who created it, what it was for, and whether it succeeded or failed. It should include measurements as well, like how long specific operations took. Every network call performed within the work unit needs to be instrumented to log its response status code and response time. Finally, data logged to the event should be sanitized and stripped of potentially sensitive properties that developers shouldn’t have access to, like user content.
整理工作单元的单个事件中的所有数据可以最大限度地减少连接的需要,但并不能完全消除它。例如,如果一项服务调用另一个下游服务,您将必须执行联接以将调用方的事件日志与被调用方的事件日志关联起来,以了解远程调用失败的原因。为了实现这一点,每个事件都应包含工作单位的请求或消息的 ID。
Collating all data within a single event for a work unit minimizes the need for joins but doesn’t completely eliminate it. For example, if a service calls another downstream, you will have to perform a join to correlate the caller’s event log with the callee’s one to understand why the remote call failed. To make that possible, every event should include the id of the request or message for the work unit.
成本
Costs
有多种方法可以控制伐木成本。一种简单的方法是通过动态旋钮控制不同的日志记录级别(例如:调试、信息、警告、错误),该动态旋钮确定发出哪些日志记录级别。这使得操作员可以出于调查目的增加日志记录的详细程度,并在不需要粒度日志时降低成本。
There are various ways to keep the costs of logging under control. A simple approach is to have different logging levels (e.g.: debug, info, warning, error) controlled by a dynamic knob that determines which ones are emitted. This allows operators to increase the logging verbosity for investigation purposes and reduce costs when granular logs aren’t needed.
采样是减少冗长的另一种选择。例如,一项服务只能每 n 个事件记录一个。此外,还可以根据事件的预期信噪比对事件进行优先级排序;例如,记录失败的请求应该比记录成功的请求具有更高的采样频率。
Sampling is another option to reduce verbosity. For example, a service could log only one every n-th event. Additionally, events can also be prioritized based on their expected signal to noise ratio; for example, logging failed requests should have a higher sampling frequency than logging successful ones.
到目前为止讨论的选项仅减少单个节点上的日志记录的详细程度。当您横向扩展并添加更多节点时,日志记录量必然会增加。即使有最好的意图,有人也可能会签入导致过多日志记录的错误。为了避免成本飙升或完全终止日志记录管道,日志收集器需要能够对请求进行速率限制。如果您使用第三方服务来提取、存储和查询日志,则可能已经存在配额。
The options discussed so far only reduce the logging verbosity on a single node. As you scale out and add more nodes, logging volume will necessarily increase. Even with the best intentions, someone could check-in a bug that leads to excessive logging. To avoid costs soaring through the roof or killing your logging pipeline entirely, log collectors need to be able to rate-limit requests. If you use a third-party service to ingest, store, and query your logs, there probably is a quota in place already.
当然,您始终可以选择根据事件中收集的测量结果(例如指标)创建内存中聚合,并仅发出这些聚合而不是原始日志。通过这样做,您可以权衡在需要时深入了解聚合的能力。
Of course, you can always opt to create in-memory aggregates from the measurements collected in events (e.g., metrics) and emit just those rather than raw logs. By doing so, you trade-off the ability to drill down into the aggregates if needed.
跟踪捕获请求在分布式系统的服务中传播时的整个生命周期。跟踪是一系列因果相关的跨度,表示系统中请求的执行流程。跨度表示映射到逻辑操作或工作单元的时间间隔,并包含一包键值对(见图21.2)。
Tracing captures the entire lifespan of a request as it propagates throughout the services of a distributed system. A trace is a list of causally-related spans that represent the execution flow of a request in a system. A span represent an interval of time that maps to a logical operation or work unit, and contains a bag of key-value pairs (see Figure 21.2).
图 21.2:执行流程可以用跨度来表示。
Figure 21.2: An execution flow can be represented with spans.
跟踪允许开发人员:
Traces allow developers to:
当请求开始时,它会被分配一个唯一的跟踪 ID。在本地执行流中的每个分叉处,跟踪 ID 从一个阶段传播到另一个阶段,从一个线程传播到另一个线程,以及在网络调用中从调用者传播到被调用者(例如,通过 HTTP 标头)。每个阶段都用一个跨度表示——一个包含跟踪 ID 的事件。
When a request begins, it’s assigned a unique trace id. The trace id is propagated from one stage to another at every fork in the local execution flow from one thread to another and from caller to callee in a network call (through HTTP headers, for example). Each stage is represented with a span — an event containing the trace id.
当跨度结束时,它被发送到收集器服务,收集器服务通过将其与属于同一跟踪的其他跨度缝合在一起来将其组装成跟踪。流行的分布式跟踪收集器包括Open Zipkin和AWS X-ray。
When a span ends, it’s emitted to a collector service, which assembles it into a trace by stitching it together with the other spans belonging to the same trace. Popular distributed tracing collectors include Open Zipkin and AWS X-ray.
将跟踪改造为现有系统具有挑战性,因为它需要修改请求路径中的每个组件并将跟踪上下文从一个阶段传播到另一阶段。不仅是您控制下的组件需要支持跟踪,还需要支持跟踪。您使用的框架和开源库也需要支持它,就像第三方服务1一样。
Tracing is challenging to retrofit into an existing system as it requires every component in the request path to be modified and propagate the trace context from one stage to the other. And it’s not just the components that are under your control that need to support tracing; the frameworks and open source libraries you use need to support it as well, just like third-party services1.
事件日志的主要缺点是它们是细粒度的且特定于服务的。
The main drawback of event logs is that they are fine-grained and service-specific.
当单个用户请求流经系统时,它可以经过多个服务。特定事件仅包含一项特定服务的工作单元的信息,因此对于调试整个请求流程没有多大用处。同样,单个事件并不能说明特定服务的运行状况或状态。
When a single user request flows through a system, it can pass through several services. A specific event only contains information for the work unit of one specific service, so it can’t be of much use to debug the entire request flow. Similarly, a single event doesn’t tell much about the health or state of a specific service.
这就是指标和跟踪的用武之地。您可以将它们视为抽象或派生视图,从事件日志构建并针对特定用例进行调整。指标是通过聚合多个工作单元或事件的计数器或观察结果而得出的汇总统计数据的时间序列。您可以在事件中发出计数器,并让后端在摄取它们时将其汇总为指标。事实上,这就是一些指标收集系统的工作方式。
This is where metrics and traces come in. You can think of them as abstractions, or derived views, built from event logs and tuned to specific use cases. A metric is a time-series of summary statistics derived by aggregating counters or observations over multiple work units or events. You could emit counters in events and have the backend roll them up into metrics as they are ingested. In fact, this is how some metrics collection systems work.
类似地,可以通过将属于特定用户请求的生命周期的所有事件聚合到有序列表中来导出跟踪。就像前面的情况一样,您可以发出单独的跨度事件,并让后端将它们聚合到跟踪中。
Similarly, a trace can be derived by aggregating all events belonging to the lifecycle of a specific user request into an ordered list. Just like in the previous case, you can emit individual span events and have the backend aggregate them together into traces.
恭喜,您已读完本书!我希望你学到了一些你以前不知道的东西,甚至可能有一些“顿悟”的时刻。尽管这是本书的结尾,但这只是您旅程的开始。学习如何设计大型系统的最佳方法之一是站在巨人的肩膀上。
Congratulations, you reached the end of the book! I hope you learned something you didn’t know before and perhaps even had a few “aha” moments. Although this is the end of the book, it’s just the beginning of your journey. One of the best ways to learn how to design large scale systems is by standing on the shoulders of the giants.
行业论文提供了有关经受时间考验的分布式系统的丰富知识。我的建议是从“ Windows Azure 存储:具有强一致性的高可用云存储服务”开始,其中描述了 Azure 的云存储系统1。Azure 的云存储是 Microsoft 构建许多其他成功产品的核心构建块。您将在那里看到书中介绍的许多概念。与 AWS S3 2不同,关键的设计决策之一是保证强一致性,从而使应用程序开发人员的工作更加轻松。
Industry papers provide a wealth of knowledge about distributed systems that have stood the test of time. My recommendation is to start with “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency,” which describes Azure’s cloud storage system1. Azure’s cloud storage is the core building block on top of which Microsoft built many other successful products. You will see many of the concepts introduced in the book there. One of the key design decisions was to guarantee strong consistency, unlike AWS S32, making the application developers’ job much easier.
一旦您消化了这一点,我建议您阅读“ Azure Data Explorer:一个大数据分析云平台,针对结构化、半结构化和非结构化数据进行交互式即席查询而优化。” 本文讨论了构建在 Azure 云存储之上的云原生事件存储的实现 - 这是这些大型系统如何相互组合的一个很好的例子3。
Once you have digested that, I suggest reading “Azure Data Explorer: a big data analytics cloud platform optimized for interactive, adhoc queries over structured, semi-structured and unstructured data.” The paper discusses the implementation of a cloud-native event store built on top of Azure’s cloud storage — a great example of how these large scale systems compose on top of each other3.
最后,如果您正在准备系统设计面试,请查看 Alex Xu 的书《系统设计面试》。这本书介绍了一个处理设计访谈的框架,并包括 10 多个案例研究。
Finally, if you are preparing for the system design interview, check out Alex Xu’s book “System Design Interview.” The book introduces a framework to tackle design interviews and includes more than 10 case studies.
将其视为 Azure 相当于 AWS S3,不幸的是它没有公开论文↩︎
Think of it as Azure’s equivalent of AWS S3, which unfortunately doesn’t have a public paper↩︎
S3 从 2020 年 12 月开始支持强一致性,不过↩︎
S3 supports strong consistency since December 2020, though↩︎
我研究了一个构建在 Azure 数据资源管理器和 Azure 存储之上的时间序列数据存储,不幸的是,目前还没有公开论文可用↩︎
I worked on a time-series data store that builds on top of Azure Data Explorer and Azure Storage, unfortunately, no public paper is available for it just yet↩︎